Loop Inversion - Nintendo Ultra64 Programmer's Manual

Rsp
Table of Contents

Advertisement

Revision 1.0
vadd
$v1, $v2, $v3
vadd
$v4, $v4, $v1
In this example, the second vadd instruction could not execute until the first
vadd has completed and written back its result. There is a
on register $v1. The result will be a pipeline stall that will effectively
serialize the vector code, seriously dampening its performance.
Fortunately, the hardware does do register usage locking in this
Note:
case; the above code may be slow, but at least it is guaranteed to generate
the correct results.
If a data dependency cannot be avoided, try rearranging code so that at least
some useful work is done during the delay.
"Keeping the pipeline full"
Hint:
maximum performance.

Loop Inversion

A common trick used in vector programming is
swapping inner and outer loops, in order to create the simplest loop with the
largest number of iterations so we can maximize vectorization.
Consider the following code fragment which could be used for vertex
translation:
for (i = 0; i < num_pts; i++) {/* for each point */
for (j=0; j<4; j++) {/* for each dimension */
point[i][j] += offset[j];
}
}
Since we can only vectorize the inner-most operation (the addition), we
would only be using 50% of our vector unit.
Now suppose we have an infinite number of vector elements. If we did, we
could swap the loops and do the outer loop four times, vectorizing the inner
loop across num_pts elements:
for (i = 0; i < 4; i++) {/* for each dimension */
for (j=0; j<num_pts; j++) {/* for each point */
point[j][i] += offset[i];
}
Performance Tips
data dependency
is going to be one of your keys to
loop inversion
. This means
131

Hide quick links:

Advertisement

Table of Contents
loading

Table of Contents