Intel PXA270 Optimization Manual page 57

Pxa27x processor family
Table of Contents

Advertisement

Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization
In the code shown in the following example, the ADD instruction following the LDR stalls for two
cycles because it uses the result of the load.
add
ldr
add
sub
mul
Rearrange the code as shown to prevent the stall:
ldr
add
sub
add
mul
This rearrangement is not always possible. In the following example, the LDR instruction cannot
be moved before the ADDNE or the SUBEQ instructions because the LDR instruction depends on
the result of these instructions.
cmp
addne r4, r5, #4
subeq r4, r5, #4
ldr
cmp
This example rewrites this code to make it run faster at the expense of increasing code size:
cmp
ldrne r0, [r5, #4]
ldreq r0, [r5, #-4]
addne r4, r5, #4
subeq r4, r5, #4
cmp
The optimized code takes six cycles to execute compared to the seven cycles taken by the
unoptimized version.
The result latency for an LDR instruction is significantly higher if the data being loaded is not in
the data cache. To help minimize the number of pipeline stalls in such a situation, move the LDR
instruction as far away as possible from the instruction that uses the result of the load. Moving the
LDR instruction can cause certain register values to be spilled to memory due to the increase in
register pressure. In such cases, use a preload instruction to ensure that the data access in the LDR
instruction hits the cache when it executes.
Intel® PXA27x Processor Family Optimization Guide
r1, r2, r3
r0, [r5]
r6, r0, r1
r8, r2, r3
r9, r2, r3
r0, [r5]
r1, r2, r3
r8, r2, r3
r6, r0, r1
r9, r2, r3
r1, #0
r0, [r4]
r0, #10
r1, #0
r0, #10
4-9

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents