Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 Manual page 221

Hide thumbs Also See for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3:
Table of Contents

Advertisement

inputs that might be single precision numbers. With the rounding performed at the 64th
precision bit (instead of the 24th for single precision) a smaller error is accumulated
with each multiply and add. Furthermore, with 17 bits of range (instead of 8 bits for
single precision) large positive and negative products can be added to the accumulator
without overflow or underflow. In addition to providing more accurate results the extra
range and precision can often enhance the performance of iterative computations that
are required to be performed until convergence (as indicated by an error bound) is
reached.
6.3.2
Multiply-Add Instruction
The Itanium architecture defines the fused multiply-add (fma) as the basic
floating-point computation, since it forms the core of many computations (linear
algebra, series expansion, etc.) and its latency in hardware is typically less than the
sum of the latencies of an individual multiply operation (with rounding) implementation
and an individual add operation (with rounding) implementation.
In computational loops that have a loop carried dependency and whose speed is often
determined by the latency of the floating-point computation rather than the peak
computational rate, the multiply-add operation can often be used advantageously.
Consider the Livermore FORTRAN Kernel 9 – General Linear Recurrence Equations:
DO 191 k= 1,n
B5(k+KB5I)= SA(k) + STB5 * SB(k)
STB5= B5(k+KB5I) - STB5
191CONTINUE
Since there is a true data dependency between the two statements on variable
B5(k+KB5I)) and a loop-carried dependency on variable STB5, the loop number of
clocks per iteration is entirely determined by the latency of the floating-point
operations. In the absence of an fma type operation, and assuming that the individual
multiply and add latencies are 5 clocks each and the loads are 8 cycles, the loop would
be:
L1:
(p16)
(p16)
(p17)
(p17)
(p17)
(p17)
With an fma, the overall latency of the chain of operations decreases and assuming a 5
cycle fma, the loop iteration speed is now 10 clocks (as opposed to 15 clocks above).
L1:
(p16)
(p16)
(p17)
(p17)
(p17)
The fused multiply-add operation also offers the advantage of a single rounding error
for the pair of computations which is valuable when trying to compute small differences
of large numbers.
1:210
ldf
f32 = [r5], 8
ldf
f42
= [r6], 8
fmul
f5
= f7, f43;;
fadd
f6
= f33, f5 ;;
stf
[r7] = f6, 8
fsub
f7
= f6, f7
br.ctop L1 ;;
ldf
f32
= [r5], 8
ldf
f42 = [r6], 8
fma
f6
= f7, f43, f33;;
stf
[r7] = f6, 8
fsub
f7
= f6, f7
br.ctop L1 ;;
// Load SA(k)
// Load SB(k)
// tmp,Clk 0,15 ...
// B5,Clk 5,20 ...
// Store B5
// STB5,Clk 10,25 ..
// Load SA(k)
// Load SB(k)
// B5,Clk 0,10 ...
// Store B5
// STB5,Clk 5,15 ..
Volume 1, Part 2: Floating-point Applications

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 and is the answer not in the manual?

Questions and answers

Subscribe to Our Youtube Channel

This manual is also suitable for:

Itanium architecture 2.3

Table of Contents