Coding Technique With Preload - Intel PXA270 Optimization Manual

Pxa27x processor family

page of 144

/ 144
Contents
Table of Contents
Bookmarks

Table of Contents

High Level Language Optimization

The following describes how these resources interact. A fill buffer is allocated for each cache read

miss. A fill buffer and a pending buffer is allocated for each cache write miss if the memory space

is marked as write allocate. A subsequent read to the same cache line does not require a new fill

buffer, but does require a pending buffer and a subsequent write will also require a new pending

buffer. A fill buffer is also allocated for each read to a non-cached memory and a write buffer is

needed for each memory write to non-cached memory that is non-coalescing. Consequently, an

STM instruction listing eight registers and referencing non-cached memory uses eight write buffers

assuming they don't coalesce and two write buffers if they do coalesce. A cache eviction requires a

write buffer for each dirty bit set in the cache line. The preload instruction requires a fill buffer for

each cache line and 0, 1, or 2 write buffers for an eviction.

When adding preload instructions, use caution to ensure that the combination of preload and

instruction fetches do not exceed the available system resource capacity described above or

performance is degraded instead of improved. It is important to intersperse preload operations

throughout calculations to allow the memory bus traffic to flow freely and to minimize the number

of necessary preloads.

Section 5.1.1.3

5.1.1.3

Coding Technique with Preload

Since preload is a powerful optimization technique, preloading opportunities can be exploited

during the code development in the high-level language. The preload instruction can be

implemented as a C-callable function and can be used at different places throughout the C-code.

The developer may choose to implement two types of routines, one which loads only one cache

line or may choose to use a function which preloads multiple cache-lines. The data usage (that is

linear array striding or 2-D array striding) can influence the choice of the preload scheme.

However, one need to keep in mind that, only four outstanding preloads are allowed and excessive

use need to be avoided.

5.1.1.3.1

Coding Technique: Unrolling With Preload

When iterating through a loop, data transfer latency can be hidden by preloading ahead one or more

iterations. The solution incurs an unwanted side effect that the final interactions of a loop loads

useless data into the cache, polluting the cache, increasing bus traffic and possibly evicting

valuable temporal data. This problem can be resolved by preload unrolling. For example refer to

this:

for(i=0; i<NMAX; i++)

{

prefetch(data[i+2]);

sum += data[i];

}

Interactions i-1 and i, preload superfluous data. The problem can be avoid by unrolling the end of

the loop.

for(i=0; i<NMAX-2; i++)

{

prefetch(data[i+2]);

sum += data[i];

}

sum += data[NMAX-2];

sum += data[NMAX-1];

5-4

discusses code optimization for preloading.

Intel® PXA27x Processor Family Optimization Guide

Table of Contents

This manual is also suitable for:

Pxa271 Pxa272 Pxa273

Coding Technique With Preload - Intel PXA270 Optimization Manual

Coding Technique with Preload

Related Manuals for Intel PXA270

Related Content for Intel PXA270

This manual is also suitable for:

Table of Contents