under-utilized during the prolog and epilog phases. Part of the prolog and epilog could
be peeled off and merged with the code preceding and following the loop.
following is a pipelined version of that counted loop with an explicit prolog and epilog:
prolog:
L1:
L2:
epilog:
The entire prolog (first three iterations of the kernel loop) and epilog (last three
iterations) have been peeled off. No attempt has been made to reschedule the peeled
instructions. The stage predicates have been removed from the instructions since they
are not required for controlling the prolog and epilog phases. Removing them from the
prolog makes the prolog instructions independent of the rotating predicates and
eliminates the need for software-pipelined loop branches between prolog stages. Thus
the entire prolog is independent of the initialization of LC and EC that precede it. The
register numbers in the prolog and epilog have been adjusted to account for the lack of
rotation between stages during those phases.
Note: This code assumes that the trip count of the source loop is at least four. If the
minimum trip count is unknown at compile time, then a runtime check of the
trip count must be added before the prolog. If the trip count is less than four,
then control branches to a copy of the original loop.
If this pipelined loop is nested inside an outer loop, there exists a further optimization
opportunity.
followed by the epilog for the current outer loop iteration and the prolog for the next
outer loop iteration. A copy of the prolog would also be added prior to the outer loop.
Note: From the earlier trace of the counted loop execution, the functional unit usage
of the prolog and epilog are complimentary such that they could be very nicely
overlapped.
The drawback of creating an explicit prolog or epilog is code expansion.
Volume 1, Part 2: Software Pipelining and Loop Support
mov
lc = 196
mov
ec = 1
ld4
r35 = [r5],4;;
ld4
r34 = [r5],4 ;;
ld4
r33 = [r5],4
add
r36 = r35,r9 ;;
ld4
r32 = [r5],4
add
r35 = r34,r9
st4
[r6] = r36,4
br.ctop L1 ;;
add
r35
= r34,r9
st4
[r6] = r36,4 ;;
add
r34 = r33,r9
st4
[r6] = r35,4 ;;
st4
[r6] = r34,4
The outer loop could be rotated such that the kernel loop is at the top
// Cycle 0
// Cycle 1
// Cycle 2
// Cycle 2
// Cycle 0
// Cycle 0
// Cycle 1
// Cycle 1
// Cycle 2
The
1:203
Need help?
Do you have a question about the ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 and is the answer not in the manual?
Questions and answers