Intel PXA270 Optimization Manual page 100

Pxa27x processor family
Table of Contents

Advertisement

High Level Language Optimization
Consider this code sample:
add
; Sequence of instructions using r2, but leave r3 unchanged.
ldr
add
mov
sub
The sub instruction above would stall if the data being loaded misses the cache. These stalls can be
avoided by using a PLD instruction as:
pld
add
; Sequence of instructions using r2, but leave r3 unchanged.
ldr
add
mov
sub
For most cases, optimizing for the external memory latency also satisfies the requirements for the
internal memory latency.
5.1.1.1.2
Preload Loop Scheduling
When adding preload instructions to a loop which operates on arrays, preload ahead one, two, or
more iterations. The data for future iterations is located in memory a fixed offset from the data for
the current iteration. This makes it easy to predict where to fetch the data. The number of iterations
to preload ahead is referred to as the preload scheduling distance (PSD). For the Intel XScale®
Microarchitecture this can be calculated as:
PSD
Where:
N
The number of core clocks required to transfer one complete cache line.
linexfer
N
The number of cache lines to be pre-loaded for both reading and writing.
pref
N
The number of cache half line evictions caused by the loop.
evict
N
The number of instructions executed in one iteration of the loop
inst
N
hwlinexfer
CPI
This is the average number of core clocks per instruction (of the instructions within the
PSD calculated in the above equation is a good initial estimation, but may not be the optimum
scheduling distance. Estimating N
uses the mini-data cache and if the loop operations overflow the mini-data cache, then a first order
5-2
r1, r1, #1
r2, [r3]
r3, r3, #4
r4, r3
r2, r2, #1
[r3]
r1, r1, #1
r2, [r3]
r3, r3, #4
r4, r3
r2, r2, #1
(
N
×
N
l inexfer
pref
=
floor
-------------------------------------------------------------------------------------------------- -
(
CPI
The number of core clocks required to write half a cache line (as if) only one of the
cache line dirty bits were set when a line eviction occurred.
loop).
evict
+
N
×
N
)
hwl inexfer
evict
×
N
)
inst
is difficult from static code. However, if the operational data
Intel® PXA27x Processor Family Optimization Guide

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents