Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 Manual page 213

Hide thumbs Also See for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3:
Table of Contents

Advertisement

Note that, in the code above, the ld4 and the add instructions in stage 2 have been
reordered. Register rotation has been used to eliminate the WAR register dependency
from the add to the ld4. The first two stages are speculative. The code to implement
the pipeline is shown below:
L1:
(p18)
(p18)
(p18)
(p17)
The problem with this pipelined loop is that the value written to r36 prior to the loop is
overwritten before it is used by the add. The value is overwritten by the load into r36
in the first kernel iteration. This load is in the second stage of the pipeline, but cannot
be controlled during the first kernel iteration because it is speculative and does not
have a stage predicate. This problem can be solved by peeling off one iteration of the
kernel and excluding from that copy any instructions that are not in the first stage of
the pipeline as shown below.
Note that the destination register numbers for the instructions in the explicit prolog
have been increased by one. This is to account for the fact that there is no rotation at
the end of the peeled kernel iteration.
L1:
(p18)
(p18)
(p18)
(p17)
In some cases, higher performance can be achieved by generating separate blocks of
code for all or part of the prolog and/or epilog phase.
trace of the pipelined counted loop from
1:202
ld4
r36 = [r5]
mov
ec = 2
mov
pr.rot = 1 << 16 ;;
ld4.s
r32 = [r8],4
ld4.s
r34 = [r9],4
and
r40 = 3,r39 ;;
ld4.s
r36 = [r35]
add
r38 = r37,r33
chk.s
r40, recovery
cmp.ne
p17,p0 = r40,r11
br.wtop
L1 ;;
ld4
r37 = [r5]
mov
ec = 1
mov
pr.rot = 1<<17;;
ld4
r33 = [r8],4
ld4
r35 = [r9],4
ld4.s
r32 = [r8],4
ld4.s
r34 = [r9],4
and
r40 = 3,r39 ;;
ld4.s
r36 = [r35]
add
r38 = r37,r33
chk.s
r40, recovery
cmp.ne
p17,p0 = r40,r11
br.wtop
L1 ;;
// PR16 = 1, rest = 0
// Cycle 0
// Cycle 0
// Cycle 0
// Cycle 1
// Cycle 1
// Cycle 1
// Cycle 1
// Cycle 1
// PR17 = 1, rest = 0
// Cycle 0
// Cycle 0
// Cycle 0
// Cycle 1
// Cycle 1
// Cycle 1
// Cycle 1
// Cycle 1
It is clear from the execution
page 1:188
that the functional units are
Volume 1, Part 2: Software Pipelining and Loop Support

Advertisement

Table of Contents
loading

This manual is also suitable for:

Itanium architecture 2.3

Table of Contents