Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 Manual page 212

Hide thumbs Also See for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3:
Table of Contents

Advertisement

This loop maintains five independent sums in registers f33-f37. The fma instruction in
iteration X produces a result that is used by the fma instruction in iteration X+5.
Iterations X through X+4 are independent, allowing an II of one to be achieved.
code for a pipelined version of the loop assuming two memory ports and a nine cycle
latency for a floating-point load is shown below:
L1:
(p16)
(p16)
(p25)
5.5.8
Explicit Prolog and Epilog
In some cases, an explicit prolog is necessary for code correctness. This can occur in
cases where a speculative instruction generates a value that is live across source
iterations. Consider the following loop:
L1:
(p1)
The following is a possible pipeline for the loop:
stage 1:
stage 2:
stage 3:
Volume 1, Part 2: Software Pipelining and Loop Support
mov
lc = 199
mov
ec = 10
mov
pr.rot=1<<16
mov
f33 = 0
mov
f34 = 0
mov
f35 = 0
mov
f36 = 0
mov
f37 = 0
ldfs
f50 = [r5],4
ldfs
f60 = [r8],4
fma
f41 = f59,f69,f46
br.ctop.sptk
L1;;
fadd
f10 = f42,f43
fadd
f11 = f44,f45 ;;
fadd
f12 = f10,f11 ;;
fadd
f7 = f12,f46
ld4
r3 = [r5] ;;
ld4
r6 = [r8],4
ld4
r5 = [r9],4 ;;
add
r7 = r3,r6 ;;
ld4
r3 = [r5]
and
r10 = 3,r7;;
cmp.ne p1,p0=r10,r11
br.cond L1 ;;
ld4.s
r6 = [r8],4
ld4.s
r5 = [r9],4 ;;
---
---
ld4.s
r36 = [r5]
add
r7 = r37,r6 ;;
(p18)
and
r10 = 3,r7 ;;
(p18)
cmp.ne p1,p0 = r10,r11
(p1)
br.wtop L1 ;;
// LC = loop count - 1
// EC = epilog stages + 1
// PR16 = 1, rest = 0
// initialize sums
// Cycle 0
// Cycle 0
// Cycle 0
// Cycle 0
// add sums
// Cycle 0
// Cycle 0
// Cycle 2
// Cycle 3
// Cycle 3
// Cycle 4
// Cycle 4
// II = 2
// empty cycle
// empty cycle
The
1:201

Advertisement

Table of Contents
loading

This manual is also suitable for:

Itanium architecture 2.3

Table of Contents