Memory Page Thrashing; Prefetch Considerations; Prefetch Distances; Prefetch Loop Scheduling - Intel PXA255 User Manual

Xscale microarchitecture
Hide thumbs Also See for PXA255:
Table of Contents

Advertisement

Optimization Guide
A.4.3.2.

Memory Page Thrashing

Memory page thrashing occurs because of the nature of SDRAM. SDRAMs are typically divided
into multiple banks. Each bank can have one selected page where a page address size for current
memory components is often defined as 4k. Memory lookup time or latency time for a selected
page address is currently 2 to 3 bus clocks. Thrashing occurs when subsequent memory accesses
within the same memory bank access different pages. The memory page change adds 3 to 4 bus
clock cycles to memory latency. This added delay extends the prefetch distance correspondingly
making it more difficult to hide memory access latencies. This type of thrashing can be resolved by
placing the conflicting data structures into different memory banks or by paralleling the data
structures such that the data resides within the same memory page. It is also extremely important to
insure that instruction and data sections are in different memory banks, or they will continually
trash the memory page selection.
A.4.4

Prefetch Considerations

The Intel® XScale™ core has a true prefetch load instruction (PLD). The purpose of this
instruction is to preload data into the data and mini-data caches. Data prefetching allows hiding of
memory transfer latency while the processor continues to execute instructions. The prefetch is
important to compiler and assembly code because judicious use of the prefetch instruction can
enormously improve throughput performance of the Intel® XScale™ core. Data prefetch can be
applied not only to loops but also to any data references within a block of code. Prefetch also
applies to data writing when the memory type is enabled as write allocate
The Intel® XScale™ core prefetch load instruction is a true prefetch instruction because the load
destination is the data or mini-data cache and not a register. Compilers for processors which have
data caches, but do not support prefetch, sometimes use a load instruction to preload the data cache.
This technique has the disadvantages of using a register to load data and requiring additional
registers for subsequent preloads and thus increasing register pressure. By contrast, the prefetch
can be used to reduce register pressure instead of increasing it.
The prefetch load is a hint instruction and does not guarantee that the data will be loaded.
Whenever the load would cause a fault or a table walk, then the processor will ignore the prefetch
instruction, the fault or table walk, and continue processing the next instruction. This is particularly
advantageous in the case where a linked list or recursive data structure is terminated by a NULL
pointer. Prefetching the NULL pointer will not fault program flow.
A.4.4.1.

Prefetch Distances

Scheduling the prefetch instruction requires some understanding of the system latency times and
system resources which affect when to use the prefetch instruction. For the PXA255 processor a
cache line fill of 8 words from external memory will take more than 10 memory clocks, depending
on external RAM speed and system timing configuration. With the core running faster than
memory, data from external memory may take many tens of core clocks to load, especially when
the data is the last in the cacheline. Thus there can be considerable savings from prefetch loads
being used many instructions before the data is referenced.
A.4.4.2.

Prefetch Loop Scheduling

When adding prefetch to a loop which operates on arrays, it may be advantageous to prefetch ahead
one, two, or more iterations. The data for future iterations is located in memory by a fixed offset
from the data for the current iteration. This makes it easy to predict where to fetch the data. The
number of iterations to prefetch ahead is referred to as the prefetch scheduling distance.
A-18
Intel® XScale™ Microarchitecture User's Manual

Advertisement

Table of Contents
loading

Table of Contents