Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 Manual page 207

Hide thumbs Also See for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3:
Table of Contents

Advertisement

The following is a possible pipeline with an II of 2:
stage 1:
stage 2:
In the source loop, one iteration is completed every three cycles. In the software
pipelined loop, it takes four cycles to complete the first iteration. Thereafter, iterations
are completed every two cycles. If the trip count is two, the execution time of both
versions of the loop is the same, six cycles. If the average trip count of the loop is less
than two, the software pipelined version of the loop is slower than the source loop.
In addition, it may not be desirable to pipeline a floating-point loop that contains a
function call. The number of floating-point registers used by the loop is not known until
after the loop is pipelined. After pipelining, it may be difficult to find empty slots for the
instructions needed to save and restore the caller-saved floating-point registers across
the function call.
5.5.5
Software Pipelining and Advanced Loads
Advanced loads allow some code that is likely to be invariant to be removed from loops,
thus reducing the resource requirements of the loop. Use of advanced loads also can
reduce the critical path through the iterations, allowing a smaller II to be achieved. See
Chapter 3, "Memory Reference"
caution must be exercised when using advanced loads with register rotation. For this
discussion, we assume an ALAT with 32 entries.
5.5.5.1
Capacity Limitations
An advanced load with a destination that is a rotating register targets a different
physical register and allocates a new ALAT entry for each kernel iteration.
example, the simple loop below replaces 32 ALAT entries in 32 iterations:
L1:
(p16)
(p47)
To avoid unnecessary ALAT misses, the check load or advanced load check must be
executed before a later advanced load causes a replacement of the entry being
checked. In the simple loop above, the unnecessary ALAT misses do not occur because
the check load is done within 31 iterations of the advanced load. In the example below,
an ALAT miss is encountered for every check load because the advanced load replaces
an entry just before the corresponding check load is executed:
L1:
(p16)
(p48)
1:196
ld4
r4 = [r5],4
ld4
r7 = [r8],4;;
---
---
st4
[r6] = r4,4
st4
[r9] = r7,4;;
for more information on advanced loads. However,
ld4.a
r32 = [r8]
ld4.c
r63 = [r8]
br.ctop L1;;
ld4.a
r32 = [r8]
ld4.c
r64 = [r8]
br.ctop L1;;
// Cycle 0
// Cycle 0
// empty cycle
// empty cycle
// Cycle 3
// Cycle 3
Volume 1, Part 2: Software Pipelining and Loop Support
For

Advertisement

Table of Contents
loading

This manual is also suitable for:

Itanium architecture 2.3

Table of Contents