Minimize Number Of Software Prefetches; Figure 6-4 Prefetch And Loop Unrolling - Intel ARCHITECTURE IA-32 Reference Manual

Architecture optimization
Table of Contents

Advertisement

Minimize Number of Software Prefetches

Prefetch instructions are not completely free in terms of bus cycles,
machine cycles and resources, even though they require minimal clocks
and memory bandwidth.
Excessive prefetching may lead to performance penalties because issue
penalties in the front-end of the machine and/or resource contention in
the memory sub-system. This effect may be severe in cases where the
target loops are small and/or cases where the target loop is issue-bound.
One approach to solve the excessive prefetching issue is to unroll and/or
software-pipeline the loops to reduce the number of prefetches required.
Figure 6-4 presents a code example which implements prefetch and
unrolls the loop to remove the redundant prefetch instructions whose
prefetch addresses hit the previously issued prefetch instructions. In this
particular example, unrolling the original loop once saves six prefetch
instructions and nine instructions for conditional jumps in every other
iteration.

Figure 6-4 Prefetch and Loop Unrolling

top_loop:
prefetchnta [edx+esi+32]
prefetchnta [edx*4+esi+32]
. . . . .
m ovaps xm m 1, [edx+esi]
m ovaps xm m 2, [edx*4+esi]
. . . . .
add esi, 16
unrolled
cm p esi, ecx
iteration
jl top_loop
Optimizing Cache Usage
top_loop:
prefetchnta [edx+esi+128]
prefetchnta [edx*4+esi+128]
. . . . .
m ovaps xm m 1, [edx+esi]
m ovaps xm m 2, [edx*4+esi]
. . . . .
m ovaps xm m 1, [edx+esi+16]
m ovaps xm m 2, [edx*4+esi+16]
. . . . .
m ovaps xm m 1, [edx+esi+96]
m ovaps xm m 2, [edx*4+esi+96]
. . . . .
. . . . .
add esi, 128
cm p esi, ecx
jl top_loop
OM15172
6
6-29

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the ARCHITECTURE IA-32 and is the answer not in the manual?

Questions and answers

Table of Contents