Scheduling Load And Store Double (Ldrd/Strd - Intel PXA255 User Manual

Xscale microarchitecture

Hide thumbs Also See for PXA255:

Developer's manual (600 pages)

Datasheet (40 pages)

Table Of Contents

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

page of 198

/ 198
Contents
Table of Contents
Bookmarks

Table of Contents

Optimization Guide

As can be seen above, the contents of the register r6 have been spilled to the stack and subsequently

loaded back to the register r6 to retain the program semantics. Another way to optimize the code

above is with the use of the preload instruction as shown below:

; all other registers are in use

add

pld

sub

mul

mov

orr

ldr

add

orr

; The value in register r6 is not used after this

The Intel® XScale™ core has 4 fill-buffers that are used to fetch data from external memory when

a data-cache miss occurs. The Intel® XScale™ core stalls when all fill buffers are in use. This

happens when more than 4 loads are outstanding and are being fetched from memory. As a result,

the code written should ensure that no more than 4 loads are outstanding at the same time. For

example, the number of loads issued sequentially should not exceed 4. Also note that a preload

instruction may cause a fill buffer to be used. As a result, the number of preload instructions

outstanding should also be considered to derive how many loads are simultaneously outstanding.

Similarly, the number of write buffers also limits the number of successive writes that can be issued

before the processor stalls. No more than eight stores can be issued. Also note that if the data

caches are using the write-allocate with writeback policy, then a load operation may cause stores to

the external memory if the read operation evicts a cache line that is dirty (modified). The number of

sequential stores may be further limited by these other writes.

A.5.1.1.

Scheduling Load and Store Double (LDRD/STRD)

The Intel® XScale™ core introduces two new double word instructions: LDRD and STRD.

LDRD loads 64-bits of data from an effective address into two consecutive registers, conversely,

STRD stores 64-bits from two consecutive registers to an effective address. There are two

important restrictions on how these instructions may be used:

•

the effective address must be aligned on an 8-byte boundary

•

the specified register must be even (r0, r2, etc.).

If this situation occurs, using LDRD/STRD instead of LDM/STM to do the same thing is more

efficient because LDRD/STRD issues in only one/two clock cycle(s), as opposed to LDM/STM

which issues in four clock cycles. Avoid LDRDs targeting R12; this incurs an extra cycle of issue

latency.

The LDRD instruction has a result latency of 3 or 4 cycles depending on the destination register

being accessed (assuming the data being loaded is in the data cache).

add

sub

; The following ldrd instruction would load values

; into registers r0 and r1

ldrd r0, [r3]

orr r8, r1, #0xf

mul

A-26

r0, r4, r5

[r0]

r1, r6, r7

r3, r6, r2

r2, r2, LSL #2

r9, r9, #0xf

r6, [r0]

r8, r6, r8

r8, r8, #4

r8, r8, #0xf

r6, r7, r8

r5, r6, r9

r7, r0, r7

Intel® XScale™ Microarchitecture User's Manual

Table of Contents

Scheduling Load And Store Double (Ldrd/Strd - Intel PXA255 User Manual

Scheduling Load and Store Double (LDRD/STRD)

Related Manuals for Intel PXA255

Related Content for Intel PXA255

Table of Contents