Preload Loop Limitations - Intel PXA270 Optimization Manual

Pxa27x processor family

page of 144

/ 144
Contents
Table of Contents
Bookmarks

Table of Contents

estimate of N

(16 bytes). Cache overflow can be estimated by the number of cache lines transferred each iteration

and the number of expected loop iterations. N

using the performance monitor "cache write-back" event count.

If the preload address is not aligned with a cache-line boundary, CWF latency should considered in

the equation. CWF can offer higher latency for the cache-line load compared non-CWF read, thus,

it is recommended to use the cache-line aligned addresses for preload.

Using the memory latency for the PXA27x processor at 200-MHz run mode and 100-MHz

SDRAM, the preload distance comes out to three cache lines. This is a rule of thumb for the

PXA27x processor for a loop which consumes and produces one cache line.

Correct scheduling of the preload loop can help large memory based operations, for example,

memory to memory copy, page copy, video related operations to load particular image blocks.

5.1.1.2

Preload Loop Limitations

It is not always advantageous to add preloads to a loop. Loop characteristics that limit value of

adding preloads are discussed below.

5.1.1.2.1

Preload Limitations: Throughput bound vs. Latency bound

The worst case is a loop which is bounded by the memory throughput. This does not benefit from

preloading because all the system resources to transfer data are quickly allocated and there are no

preload instructions that can be executed without impacting (non-preload) memory loads.

However, if the application is bounded by the memory latency, preload can effectively hide the

memory latency. Applications requiring large data manipulation, such as graphics applications,

video applications etc., can greatly benefit from preloading.

5.1.1.2.2

Preload Limitations: Low Number of Iterations

Loops with a low number of iterations may completely mitigate the advantage of preloading. A

loop with a small fixed number of iterations may be faster if the loop is completely unrolled rather

than trying to schedule preload instructions.

5.1.1.2.3

Preload Limitations: Bandwidth Consumption

Overuse of preloads can usurp resources and degrade performance. This happens because once the

bus traffic requests exceed the system resource capacity, the processor stalls. Intel XScale®

Microarchitecture data transfer resources are:

•

4 fill buffers

•

4 pending buffers

•

8 half cache line write buffer

SDRAM resources are typically:

•

4 memory banks

•

1 page buffer per bank referencing a 4K address range

•

4 transfer request buffers

Intel® PXA27x Processor Family Optimization Guide

is the number of bytes written per loop iteration divided by a half cache line size

evict

High Level Language Optimization

and CPI can be estimated by profiling the code

evict

5-3

Table of Contents

This manual is also suitable for:

Pxa271 Pxa272 Pxa273

Preload Loop Limitations - Intel PXA270 Optimization Manual

Preload Loop Limitations

Related Manuals for Intel PXA270

Related Content for Intel PXA270

This manual is also suitable for:

Table of Contents