Intel PXA270 Optimization Manual page 100

Pxa27x processor family

page of 144

/ 144
Contents
Table of Contents
Bookmarks

Table of Contents

High Level Language Optimization

Consider this code sample:

add

; Sequence of instructions using r2, but leave r3 unchanged.

ldr

add

mov

sub

The sub instruction above would stall if the data being loaded misses the cache. These stalls can be

avoided by using a PLD instruction as:

pld

add

; Sequence of instructions using r2, but leave r3 unchanged.

ldr

add

mov

sub

For most cases, optimizing for the external memory latency also satisfies the requirements for the

internal memory latency.

5.1.1.1.2

Preload Loop Scheduling

When adding preload instructions to a loop which operates on arrays, preload ahead one, two, or

more iterations. The data for future iterations is located in memory a fixed offset from the data for

the current iteration. This makes it easy to predict where to fetch the data. The number of iterations

to preload ahead is referred to as the preload scheduling distance (PSD). For the Intel XScale®

Microarchitecture this can be calculated as:

PSD

Where:

The number of core clocks required to transfer one complete cache line.

linexfer

The number of cache lines to be pre-loaded for both reading and writing.

pref

The number of cache half line evictions caused by the loop.

evict

The number of instructions executed in one iteration of the loop

inst

hwlinexfer

CPI

This is the average number of core clocks per instruction (of the instructions within the

PSD calculated in the above equation is a good initial estimation, but may not be the optimum

scheduling distance. Estimating N

uses the mini-data cache and if the loop operations overflow the mini-data cache, then a first order

5-2

r1, r1, #1

r2, [r3]

r3, r3, #4

r4, r3

r2, r2, #1

[r3]

r1, r1, #1

r2, [r3]

r3, r3, #4

r4, r3

r2, r2, #1

(



l inexfer

pref

floor

-------------------------------------------------------------------------------------------------- -



(

CPI

The number of core clocks required to write half a cache line (as if) only one of the

cache line dirty bits were set when a line eviction occurred.

loop).

evict

)



hwl inexfer

evict



)

inst

is difficult from static code. However, if the operational data

Intel® PXA27x Processor Family Optimization Guide

Table of Contents

This manual is also suitable for:

Pxa271 Pxa272 Pxa273

Intel PXA270 Optimization Manual page 100

Related Manuals for Intel PXA270

Related Products for Intel PXA270

This manual is also suitable for:

Table of Contents