Example 6-5 Concatenation And Unrolling The Last Iteration Of Inner Loop; Example 6-4 Using Prefetch Concatenation - Intel ARCHITECTURE IA-32 Reference Manual

Architecture optimization
Table of Contents

Advertisement

IA-32 Intel® Architecture Optimization
Example 6-4
Using Prefetch Concatenation
for (ii = 0; ii < 100; ii++) {
for (jj = 0; jj < 32; jj+=8) {
}
}
Prefetch concatenation can bridge the execution pipeline bubbles
between the boundary of an inner loop and its associated outer loop.
Simply by unrolling the last iteration out of the inner loop and
specifying the effective prefetch address for data used in the following
iteration, the performance loss of memory de-pipelining can be
completely removed. Example 6-5 gives the rewritten code.
Example 6-5
Concatenation and Unrolling the Last Iteration of Inner Loop
for (ii = 0; ii < 100; ii++) {
for (jj = 0; jj < 24; jj+=8) { /* N-1 iterations */
}
prefetch a[ii+1][0]
computation a[ii][jj]/* Last iteration */
}
This code segment for data prefetching is improved and only the first
iteration of the outer loop suffers any memory access latency penalty,
assuming the computation time is larger than the memory latency.
Inserting a prefetch of the first data element needed prior to entering the
nested loop computation would eliminate or reduce the start-up penalty
for the very first iteration of the outer loop. This uncomplicated
high-level code optimization can improve memory performance
significantly.
6-28
prefetch a[ii][jj+8]
computation a[ii][jj]
prefetch a[ii][jj+8]
computation a[ii][jj]

Advertisement

Table of Contents
loading

Table of Contents