Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 Manual page 220

Hide thumbs Also See for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3:
Table of Contents

Advertisement

If we suppose the minimum floating-point load latency is 9 clocks, and 2 memory
operations can be issued per clock, the above loop has to be unrolled by at least six if
there is no register rotation.
L1:
(p18)
(p18)
(p17)
(p16)
(p16)
(p17)
(p16)
(p16)
(p18)
(p18)
(p17)
(p16)
(p16)
(p17)
(p16)
(p16)
(p18)
(p18)
(p16)
(p16)
(p16)
(p16)
(p16)
(p16)
However, with register rotation, the same loop can be scheduled with an initiation
interval of just 2 clocks without unrolling (and 1.5 clocks if unrolled by 2):
L1:
(p24)
(p21)
(p16)
(p16)
It is thus often advantageous to modulo schedule and then unroll (if required). Please
see
Chapter 5, "Software Pipelining and Loop Support"
loops using this transformation.
6.3.1.1
Notes on FP Precision
The floating-point registers are 82 bits wide with 17 bits for exponent range, 64 bits for
significand precision and 1 sign bit. During computation, the result range and precision
is determined by the computational model chosen by the user. The computational
model is indicated either statically in the instruction encoding, or dynamically via the
precision control (PC) and widest-range-exponent (WRE) bits in the floating-point
status register. Using an appropriate computational model, the user can minimize the
error accumulation in the computation. In the above matrix multiply example, if the
multiply and add computations are performed in full register file range and precision,
the results (in accumulators) can hold 64 bits of precision and up to 17 bits of range for
Volume 1, Part 2: Floating-point Applications
add
r8 = r7, 8
stf
[r7] = f25, 16
stf
[r8] = f26, 16
fadd
f25 = f5, f15
ldf
f5 = [r5], 8
ldf
f15 = [r6], 8
fadd
f26 = f6, f16;;
ldf
f6 = [r5], 8
ldf
f16 = [r6], 8
stf
[r7] = f27, 16
stf
[r8] = f28, 16
fadd
f27 = f7, f17 ;;
ldf
f7 = [r5], 8
ldf
f17 = [r6], 8
fadd
f28 = f8, f18 ;;
ldf
f8 = [r5], 8
ldf
f18 = [r6], 8
stf
[r7] = f29, 16
stf
[r8] = f30, 16
fadd
f29 = f9, f19 ;;
ldf
f9 = [r5], 8
ldf
f19 = [r6], 8
fadd
f30 = f10, f20 ;;
ldf
f10 = [r5], 8
ldf
f20 = [r6], 8
br.ctop L1 ;;
stf
[r7] = f57, 8
fadd
f57
= f37, f47
ldf
f32
= [r5], 8
ldf
f42
= [r6], 8
br.ctop L1;;
// Cycle 17,26...
// Cycle 17,26...
// Cycle 8,17,26...
// Cycle 0,9,18...
// Cycle 0,9,18...
// Cycle 9,18,27 ...
// Cycle 1,10,19 ...
// Cycle 1,10,19 ...
// Cycle 20,29 ...
// Cycle 20,29 ...
// Cycle 11,20 ...
// Cycle 3,12,21 ...
// Cycle 3,12,21 ...
// Cycle 12,21 ...
// Cycle 4,13,22 ...
// Cycle 4,13,22 ...
// Cycle 23,32 ...
// Cycle 23,32 ...
// Cycle 14,23 ...
// Cycle 6,15,24 ...
// Cycle 6,15,24 ...
// Cycle 15,24 ...
// Cycle 7,16,25 ...
// Cycle 7,16,25 ...
// Cycle 15,17...
// Cycle 9,11,13...
// Cycle 0,2,4,6...
// Cycle 0,2,4,6...
for details on how to rewrite
1:209

Advertisement

Table of Contents
loading

This manual is also suitable for:

Itanium architecture 2.3

Table of Contents