Memory Page Thrashing; Prefetch Considerations; Prefetch Distances; Prefetch Loop Scheduling - Intel PXA255 User Manual

Xscale microarchitecture

Hide thumbs Also See for PXA255:

Developer's manual (600 pages)

Datasheet (40 pages)

Table Of Contents

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

page of 198

/ 198
Contents
Table of Contents
Bookmarks

Table of Contents

Optimization Guide

A.4.3.2.

Memory Page Thrashing

Memory page thrashing occurs because of the nature of SDRAM. SDRAMs are typically divided

into multiple banks. Each bank can have one selected page where a page address size for current

memory components is often defined as 4k. Memory lookup time or latency time for a selected

page address is currently 2 to 3 bus clocks. Thrashing occurs when subsequent memory accesses

within the same memory bank access different pages. The memory page change adds 3 to 4 bus

clock cycles to memory latency. This added delay extends the prefetch distance correspondingly

making it more difficult to hide memory access latencies. This type of thrashing can be resolved by

placing the conflicting data structures into different memory banks or by paralleling the data

structures such that the data resides within the same memory page. It is also extremely important to

insure that instruction and data sections are in different memory banks, or they will continually

trash the memory page selection.

A.4.4

Prefetch Considerations

The Intel® XScale™ core has a true prefetch load instruction (PLD). The purpose of this

instruction is to preload data into the data and mini-data caches. Data prefetching allows hiding of

memory transfer latency while the processor continues to execute instructions. The prefetch is

important to compiler and assembly code because judicious use of the prefetch instruction can

enormously improve throughput performance of the Intel® XScale™ core. Data prefetch can be

applied not only to loops but also to any data references within a block of code. Prefetch also

applies to data writing when the memory type is enabled as write allocate

The Intel® XScale™ core prefetch load instruction is a true prefetch instruction because the load

destination is the data or mini-data cache and not a register. Compilers for processors which have

data caches, but do not support prefetch, sometimes use a load instruction to preload the data cache.

This technique has the disadvantages of using a register to load data and requiring additional

registers for subsequent preloads and thus increasing register pressure. By contrast, the prefetch

can be used to reduce register pressure instead of increasing it.

The prefetch load is a hint instruction and does not guarantee that the data will be loaded.

Whenever the load would cause a fault or a table walk, then the processor will ignore the prefetch

instruction, the fault or table walk, and continue processing the next instruction. This is particularly

advantageous in the case where a linked list or recursive data structure is terminated by a NULL

pointer. Prefetching the NULL pointer will not fault program flow.

A.4.4.1.

Prefetch Distances

Scheduling the prefetch instruction requires some understanding of the system latency times and

system resources which affect when to use the prefetch instruction. For the PXA255 processor a

cache line fill of 8 words from external memory will take more than 10 memory clocks, depending

on external RAM speed and system timing configuration. With the core running faster than

memory, data from external memory may take many tens of core clocks to load, especially when

the data is the last in the cacheline. Thus there can be considerable savings from prefetch loads

being used many instructions before the data is referenced.

A.4.4.2.

Prefetch Loop Scheduling

When adding prefetch to a loop which operates on arrays, it may be advantageous to prefetch ahead

one, two, or more iterations. The data for future iterations is located in memory by a fixed offset

from the data for the current iteration. This makes it easy to predict where to fetch the data. The

number of iterations to prefetch ahead is referred to as the prefetch scheduling distance.

A-18

Intel® XScale™ Microarchitecture User's Manual

Table of Contents

Memory Page Thrashing; Prefetch Considerations; Prefetch Distances; Prefetch Loop Scheduling - Intel PXA255 User Manual

Memory Page Thrashing

Prefetch Considerations

Prefetch Distances

Prefetch Loop Scheduling

Related Manuals for Intel PXA255

Related Content for Intel PXA255

Table of Contents