Scheduling Swp And Swpb Instructions; Scheduling The Mra And Mar Instructions (Mrrc/Mcrr - Intel PXA255 User Manual

Xscale microarchitecture
Hide thumbs Also See for PXA255:
Table of Contents

Advertisement

The add operation above would stall for 3 cycles if the multiply takes 4 cycles to complete. It is
better to replace the code segment above with the following sequence:
mul
add
sub
sub
cmp
Please refer to
instructions. The multiply instructions should be scheduled taking into consideration these
instruction latencies.
A.5.4

Scheduling SWP and SWPB Instructions

The SWP and SWPB instructions have a 5 cycle issue latency. As a result of this latency, the
instruction following the SWP/SWPB instruction would stall for 4 cycles. SWP and SWPB
instructions should, therefore, be used only where absolutely needed.
For example, the following code may be used to swap the contents of 2 memory locations:
; Swap the contents of memory locations pointed to by r0 and r1
ldr
swp
str
The code above takes 9 cycles to complete. The rewritten code below, takes 6 cycles to execute;
assuming the availability of r3.
; Swap the contents of memory locations pointed to by r0 and r1
ldr
ldr
str
str
A.5.5

Scheduling the MRA and MAR Instructions (MRRC/MCRR)

The MRA (MRRC) instruction has an issue latency of 1 cycle, a result latency of 2 or 3 cycles
depending on the destination register value being accessed and a resource latency of 2 cycles.
Consider the code sample:
mra
mra
add
The code shown above would incur a 1-cycle stall due to the 2-cycle resource latency of an MRA
instruction. The code can be rearranged as shown below to prevent this stall.
mra
add
mra
Similarly, the code shown below would incur a 2 cycle penalty due to the 3-cycle result latency for
the second destination register.
mra
mov
mov
add
Intel® XScale™ Microarchitecture User's Manual
r0, r1, r2
r3, r3, #1
r4, r4, #1
r5, r5, #1
r0, #0
Section 11.2, "Instruction Latencies"
r2, [r0]
r2, [r1]
r2, [r1]
r2, [r0]
r3, [r1]
r2, [r1]
r3, [r0]
r6, r7, acc0
r8, r9, acc0
r1, r1, #1
r6, r7, acc0
r1, r1, #1
r8, r9, acc0
r6, r7, acc0
r1, r7
r0, r6
r2, r2, #1
Optimization Guide
to get the instruction latencies for the multiply
A-29

Advertisement

Table of Contents
loading

Table of Contents