Floating-Point Applications - Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 Manual

Hide thumbs Also See for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3:
Table of Contents

Advertisement

Floating-point Applications

6.1
Overview
The Itanium floating-point architecture is fully ANSI/IEEE-754 standard compliant and
provides performance enhancing features such as the fused multiply accumulate
instruction, the large floating-point register file (with static and rotating sections), the
extended range register file data representation, the multiple independent
floating-point status fields, and the high bandwidth memory access instructions that
enable the creation of compact, high performance, floating-point application code.
The beginning of this chapter reviews some specific performance limitations that are
common in floating-point intensive application codes. Later, architectural features that
address these limitations are presented with illustrative code examples. The remainder
of this chapter highlights the optimization of some commonly used kernels using these
features.
6.2
FP Application Performance Limiters
Floating-point applications are characterized by a predominance of loops. Some loops
compute complex calculations on regularly structured data, others simply copy data
from one place to another, while others perform gather/scatter-type operations that
simultaneously compute and rearrange data. The following sections describe code
characteristics that limit performance and how they affect these different kinds of
loops.
6.2.1
Execution Latency
Loops often contain recurrence relationships. Consider the tri-diagonal elimination
kernel from the Livermore Fortran Kernel suite.
DO 5 i = 2, N
5X[i] = Z[i] * (Y[i] - X[i-1])
The dependency between
the sum of the latency of the subtract and the multiply. The available parallelism can be
increased by unrolling the loop and can be exploited by replicating computation,
however the fundamental limitation of the data dependency remains.
Sometimes, even if the loop is vectorizable and can be software pipelined, the iteration
time of the loop is limited by the execution latency of the hardware that executes the
code. A simple vector divide (shown below) is a typical example:
DO 1 I = 1, N
1X[i] = Y[i] / Z[i]
Since typical modern microprocessors contain a non-pipelined floating-point unit, the
iteration time of the loop is the latency of the divide which can be tens of clocks.
Volume 1, Part 2: Floating-point Applications
X[i]
X[i-1]
and
limits the iteration time of the loop to be
6
1:205

Advertisement

Table of Contents
loading

This manual is also suitable for:

Itanium architecture 2.3

Table of Contents