Software Pipelining - Nintendo Ultra64 Programmer's Manual

Rsp

Hide thumbs

Table Of Contents

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

page of 331

/ 331
Contents
Table of Contents
Bookmarks

Table of Contents

Advanced Information

130

there are, and this number is not variable. (2) we have severe code space

constraints. Abstracting the vector unit size has severe implications on the

vector code start-up.

The point of this discussion is to observe that the hardware architecture is

clearly visible in the microcode. We program for a specific vector size, and

we waste no code generalizing data parallelism.

The good news is that this limitation also has a major benefit: We are

exposed to the hardware at a low enough level that we can, by inspection,

determine if the vector unit is fully utilized. This is rarely possible, if at all,

on a machine with an architecture or compiler designed for configurable

vector elements (like a Cray).

"Keeping the vector elements full"

Hint:

keys to maximum performance.

Software Pipelining

SIMD processing achieves maximum performance when there is a high

data parallelism

degree of

independent data items that can all be operated on at once.

An important idea in vector processing is that

allowed. Consider this code fragment:

for (i=0; i<n; i++) {

a[i] = a[i-1] * 2.0;

}

In this example, we could not vectorize this loop because element a[i]

depends on element a[i-1]. The elements are not independent. This

provides a restriction on the kind of loops we can vectorize and the

organization of our data (which "axis" we choose to vectorize). It also

suggests games we might want to play with our loops (See "Loop Inversion"

on page 131.).

A similar problem, another kind of pipelining problem, is

Because the vector unit has a non-zero pipeline delay, we cannot attempt to

use the results of an instruction until several clock cycles after that

instruction is "executed":

. This simply means that their are lots of

data recurrence

is going to be one of your

is not

data dependency.

Table of Contents

Show Quick Links

Hide quick links:

Table of Contents

Software Pipelining - Nintendo Ultra64 Programmer's Manual

Software Pipelining

Hide quick links:

Related Manuals for Nintendo Ultra64

Related Products for Nintendo Ultra64

Table of Contents