IBM Power7 Optimization And Tuning Manual page 138

Table of Contents

Advertisement

Peephole optimization
Peephole optimizations require a small context around the specific site in the code that is
problematic. The more important ones that FDPR performs are -las, -tlo, and -nop.
--load-after-store (-las): In recent Power Architectures, when a load instruction from
address A closely follows a store to that address, it can cause the load to be rejected. The
instruction is then tried in a slower mode, which produces a large performance penalty.
This behavior is also called
pushed further from the store, thus avoiding the reject condition.
--toc-load-optimization (-tlo): The TOC (Table-Of-Content) is a data section in
programs where pointers are kept to avoid the lengthy address computation at run time.
Loading an address (a pointer) is a costly operation and FDPR is able to save on
processing if the address is close enough to the TOC anchor (R2). In such cases, the load
from TOC is replaced by an addi Rt,R2,offset, where R2+offset = loaded address. The
optimization is performed after data is reordered so that commonly accessed data is
placed closer to R2, increasing the potential of this optimization. A TOC is used in 32-bit
and 64-bit programs on AIX, and in 64-bit programs on Power Systems running Linux.
Linux 32-bit uses a GOT and this optimization is not relevant there.
--nop-removal (-nop): The compiler (or the linker) sometimes inserts no-operation (NOP)
instructions in various places to create some necessary space in the instruction stream.
The most common place is following a function call in code. Because the call might have
modified the TOC anchor register (R2), the compiler inserts a load instruction that resets
R2 to its correct value for the current function. Because FDPR has a global view of the
program, the optimization can remove the NOP if the called function uses the same TOC
(the TOC anchor is used in AIX and in Linux 64-bit).
Data reordering
The profile that is collected by FDPR provides important information about the running of
branch instructions, thus enabling efficient code reordering. The profile does not provide this
direct information whether to put specific objects one after the other. Nevertheless, FDPR is
able to infer such placement by using the collected profile.
The relevant options are:
--reorder-data (-RD): This optimization reorders data by placing pointers and data closer
to the TOC anchor, depending on their hotness. FDPR uses a heuristic where the hotness
is computed as the total count of basic blocks where the pointer to the data was retrieved
from the TOC.
--reduce-toc thres (-rt thres): The optimization removes from the TOC entries that are
colder than the threshold. Their access, if any, is replaced by computing the address (see
-tlo optimization in "Peephole optimization" on page 122). Typically, you use -rt 0, which
removes only the entries that are never accessed.
Combination optimizations
FDPR has predefined optimization sets that provide a good starting point for
performance tuning:
-O: Performs code reordering (-RC) with branch prediction bit setting (-bp), branch folding
(-bf), and NOOP instructions removal (-nop).
-O2: Adds to -O function de-virtualization (-pto), TOC-load optimization (-tlo), function
inlining (-isf 8), and some function optimizations (-hr, -see 0, and -kr).
122
POWER7 and POWER7+ Optimization and Tuning Guide
Load-Hit-Store (LHS)
. With the -las optimization, the load is

Advertisement

Table of Contents
loading

This manual is also suitable for:

Power7+

Table of Contents