A.4.4
Prefetch Considerations
The Intel XScale
is to preload data into the data and mini-data caches. Data prefetching allows hiding of memory
transfer latency while the processor continues to execute instructions. The prefetch is important to
compiler and assembly code because judicious use of the prefetch instruction can enormously
improve throughput performance of the core. Data prefetch can be applied not only to loops but
also to any data references within a block of code. Prefetch also applies to data writing when the
memory type is enabled as write allocate
The Intel XScale
destination is the data or mini-data cache and not a register. Compilers for processors which have
data caches, but do not support prefetch, sometimes use a load instruction to preload the data cache.
This technique has the disadvantages of using a register to load data and requiring additional
registers for subsequent preloads and thus increasing register pressure. By contrast, the prefetch
can be used to reduce register pressure instead of increasing it.
The prefetch load is a hint instruction and does not guarantee that the data will be loaded.
Whenever the load would cause a fault or a table walk, then the processor will ignore the prefetch
instruction, the fault or table walk, and continue processing the next instruction. This is particularly
advantageous in the case where a linked list or recursive data structure is terminated by a NULL
pointer. Prefetching the NULL pointer will not fault program flow.
A.4.4.1.
Prefetch Distances
Scheduling the prefetch instruction requires understanding the system latency times and system
resources which affect when to use the prefetch instruction. Refer to the Intel XScale
implementation option section of the ASSP architecture specification for more information.
A.4.4.2.
Prefetch Loop Scheduling
When adding prefetch to a loop which operates on arrays, it may be advantages to prefetch ahead
one, two, or more iterations. The data for future iterations is located in memory by a fixed offset
from the data for the current iteration. This makes it easy to predict where to fetch the data. The
number of iterations to prefetch ahead is referred to as the prefetch scheduling distance. Refer to
the Intel XScale
more information.
A.4.4.3.
Prefetch Loop Limitations
It is not always advantages to add prefetch to a loop. Loop characteristics that limit the use value of
prefetch are discussed below.
A.4.4.4.
Compute vs. Data Bus Bound
At the extreme, a loop, which is data bus bound, will not benefit from prefetch because all the
system resources to transfer data are quickly allocated and there are no instructions that can
profitably be executed. On the other end of the scale, compute bound loops allow complete hiding
of all data transfer latencies.
Developer's Manual
®
core has a true prefetch load instruction (PLD). The purpose of this instruction
®
core prefetch load instruction is a true prefetch instruction because the load
®
core implementation option section of the ASSP architecture specification for
January, 2004
Intel XScale® Core Developer's Manual
Optimization Guide
®
core
199
Need help?
Do you have a question about the XScale Core and is the answer not in the manual?