Preload Loop Limitations - Intel PXA270 Optimization Manual

Pxa27x processor family
Table of Contents

Advertisement

estimate of N
(16 bytes). Cache overflow can be estimated by the number of cache lines transferred each iteration
and the number of expected loop iterations. N
using the performance monitor "cache write-back" event count.
If the preload address is not aligned with a cache-line boundary, CWF latency should considered in
the equation. CWF can offer higher latency for the cache-line load compared non-CWF read, thus,
it is recommended to use the cache-line aligned addresses for preload.
Using the memory latency for the PXA27x processor at 200-MHz run mode and 100-MHz
SDRAM, the preload distance comes out to three cache lines. This is a rule of thumb for the
PXA27x processor for a loop which consumes and produces one cache line.
Correct scheduling of the preload loop can help large memory based operations, for example,
memory to memory copy, page copy, video related operations to load particular image blocks.
5.1.1.2

Preload Loop Limitations

It is not always advantageous to add preloads to a loop. Loop characteristics that limit value of
adding preloads are discussed below.
5.1.1.2.1
Preload Limitations: Throughput bound vs. Latency bound
The worst case is a loop which is bounded by the memory throughput. This does not benefit from
preloading because all the system resources to transfer data are quickly allocated and there are no
preload instructions that can be executed without impacting (non-preload) memory loads.
However, if the application is bounded by the memory latency, preload can effectively hide the
memory latency. Applications requiring large data manipulation, such as graphics applications,
video applications etc., can greatly benefit from preloading.
5.1.1.2.2
Preload Limitations: Low Number of Iterations
Loops with a low number of iterations may completely mitigate the advantage of preloading. A
loop with a small fixed number of iterations may be faster if the loop is completely unrolled rather
than trying to schedule preload instructions.
5.1.1.2.3
Preload Limitations: Bandwidth Consumption
Overuse of preloads can usurp resources and degrade performance. This happens because once the
bus traffic requests exceed the system resource capacity, the processor stalls. Intel XScale®
Microarchitecture data transfer resources are:
4 fill buffers
4 pending buffers
8 half cache line write buffer
SDRAM resources are typically:
4 memory banks
1 page buffer per bank referencing a 4K address range
4 transfer request buffers
Intel® PXA27x Processor Family Optimization Guide
is the number of bytes written per loop iteration divided by a half cache line size
evict
High Level Language Optimization
and CPI can be estimated by profiling the code
evict
5-3

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents