Scheduling Load And Store Double (Ldrd/Strd) - Intel XScale Core Developer's Manual

Table of Contents

Advertisement

Intel XScale® Core Developer's Manual
Optimization Guide
A.5.1.1.

Scheduling Load and Store Double (LDRD/STRD)

The Intel XScale
loads 64-bits of data from an effective address into two consecutive registers, conversely, STRD
stores 64-bits from two consecutive registers to an effective address. There are two important
restrictions on how these instructions may be used:
the effective address must be aligned on an 8-byte boundary
the specified register must be even (r0, r2, etc.).
If this situation occurs, using LDRD/STRD instead of LDM/STM to do the same thing is more
efficient because LDRD/STRD issues in only one/two clock cycle(s), as opposed to LDM/STM
which issues in four clock cycles. Avoid LDRDs targeting R12; this incurs an extra cycle of issue
latency.
The LDRD instruction has a result latency of 3 or 4 cycles depending on the destination register
being accessed (assuming the data being loaded is in the data cache).
add
sub
; The following ldrd instruction would load values
; into registers r0 and r1
ldrd r0, [r3]
orr
mul
In the code example above, the ORR instruction would stall for 3 cycles because of the 4 cycle
result latency for the second destination register of an LDRD instruction. The code shown above
can be rearranged to remove the pipeline stalls:
; The following ldrd instruction would load values
; into registers r0 and r1
ldrd
add
sub
mul
orr
Any memory operation following a LDRD instruction (LDR, LDRD, STR and so on) would stall
for 1 cycle.
; The str instruction below would stall for 1 cycle
ldrd
str
210
®
core introduces two new double word instructions: LDRD and STRD. LDRD
r6, r7, r8
r5, r6, r9
r8, r1, #0xf
r7, r0, r7
r0, [r3]
r6, r7, r8
r5, r6, r9
r7, r0, r7
r8, r1, #0xf
r0, [r3]
r4, [r5]
January, 2004
Developer's Manual

Advertisement

Table of Contents
loading

Table of Contents