Software Pipelining - Nintendo Ultra64 Programmer's Manual

Rsp
Table of Contents

Advertisement

Advanced Information
130
there are, and this number is not variable. (2) we have severe code space
constraints. Abstracting the vector unit size has severe implications on the
vector code start-up.
The point of this discussion is to observe that the hardware architecture is
clearly visible in the microcode. We program for a specific vector size, and
we waste no code generalizing data parallelism.
The good news is that this limitation also has a major benefit: We are
exposed to the hardware at a low enough level that we can, by inspection,
determine if the vector unit is fully utilized. This is rarely possible, if at all,
on a machine with an architecture or compiler designed for configurable
vector elements (like a Cray).
"Keeping the vector elements full"
Hint:
keys to maximum performance.

Software Pipelining

SIMD processing achieves maximum performance when there is a high
data parallelism
degree of
independent data items that can all be operated on at once.
An important idea in vector processing is that
allowed. Consider this code fragment:
for (i=0; i<n; i++) {
a[i] = a[i-1] * 2.0;
}
In this example, we could not vectorize this loop because element a[i]
depends on element a[i-1]. The elements are not independent. This
provides a restriction on the kind of loops we can vectorize and the
organization of our data (which "axis" we choose to vectorize). It also
suggests games we might want to play with our loops (See "Loop Inversion"
on page 131.).
A similar problem, another kind of pipelining problem, is
Because the vector unit has a non-zero pipeline delay, we cannot attempt to
use the results of an instruction until several clock cycles after that
instruction is "executed":
. This simply means that their are lots of
data recurrence
is going to be one of your
is not
data dependency.

Hide quick links:

Advertisement

Table of Contents
loading

Table of Contents