Scheduling Load And Store Double (Ldrd/Strd - Intel PXA255 User Manual

Xscale microarchitecture
Hide thumbs Also See for PXA255:
Table of Contents

Advertisement

Optimization Guide
As can be seen above, the contents of the register r6 have been spilled to the stack and subsequently
loaded back to the register r6 to retain the program semantics. Another way to optimize the code
above is with the use of the preload instruction as shown below:
; all other registers are in use
add
pld
sub
mul
mov
orr
ldr
add
add
orr
; The value in register r6 is not used after this
The Intel® XScale™ core has 4 fill-buffers that are used to fetch data from external memory when
a data-cache miss occurs. The Intel® XScale™ core stalls when all fill buffers are in use. This
happens when more than 4 loads are outstanding and are being fetched from memory. As a result,
the code written should ensure that no more than 4 loads are outstanding at the same time. For
example, the number of loads issued sequentially should not exceed 4. Also note that a preload
instruction may cause a fill buffer to be used. As a result, the number of preload instructions
outstanding should also be considered to derive how many loads are simultaneously outstanding.
Similarly, the number of write buffers also limits the number of successive writes that can be issued
before the processor stalls. No more than eight stores can be issued. Also note that if the data
caches are using the write-allocate with writeback policy, then a load operation may cause stores to
the external memory if the read operation evicts a cache line that is dirty (modified). The number of
sequential stores may be further limited by these other writes.
A.5.1.1.

Scheduling Load and Store Double (LDRD/STRD)

The Intel® XScale™ core introduces two new double word instructions: LDRD and STRD.
LDRD loads 64-bits of data from an effective address into two consecutive registers, conversely,
STRD stores 64-bits from two consecutive registers to an effective address. There are two
important restrictions on how these instructions may be used:
the effective address must be aligned on an 8-byte boundary
the specified register must be even (r0, r2, etc.).
If this situation occurs, using LDRD/STRD instead of LDM/STM to do the same thing is more
efficient because LDRD/STRD issues in only one/two clock cycle(s), as opposed to LDM/STM
which issues in four clock cycles. Avoid LDRDs targeting R12; this incurs an extra cycle of issue
latency.
The LDRD instruction has a result latency of 3 or 4 cycles depending on the destination register
being accessed (assuming the data being loaded is in the data cache).
add
sub
; The following ldrd instruction would load values
; into registers r0 and r1
ldrd r0, [r3]
orr r8, r1, #0xf
mul
A-26
r0, r4, r5
[r0]
r1, r6, r7
r3, r6, r2
r2, r2, LSL #2
r9, r9, #0xf
r6, [r0]
r8, r6, r8
r8, r8, #4
r8, r8, #0xf
r6, r7, r8
r5, r6, r9
r7, r0, r7
Intel® XScale™ Microarchitecture User's Manual

Advertisement

Table of Contents
loading

Table of Contents