Memory Access Control - Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 Manual

Hide thumbs Also See for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3:
Table of Contents

Advertisement

architecture provides instructions that allow moving floating-point fields between the
integer and floating-point register files. Division of a floating-point number by 2.0 is
accomplished as follows:
getf.exp
add
setf.exp
fmerge.se
Floating-point values can also be constructed from fields from different floating-point
registers.
6.3.7

Memory Access Control

Recognizing the trend of growing memory access latency, and the implementation costs
of high bandwidth, the Itanium architecture incorporates many architectural features to
help manage the memory hierarchy and increase performance. As described in
Section
6.2, memory latency and bandwidth are significant performance limiters in
floating-point applications. The architecture offers features to address both these
limitations.
In order to enhance the core bandwidth to the floating-point register file, the
architecture defines load-pair instructions. In order to mitigate the memory latency,
explicit and implicit data prefetch instructions are defined. In order to maximize the
utilization of caches, the architecture defines locality attributes as part of memory
access instructions to help control the allocation (and de-allocation) of data in the
caches. For instances where the instruction bandwidth may become a performance
limiter, the architecture defines machine hints to trigger relevant instruction prefetches.
6.3.7.1
Load-pair Instructions
The floating-point load pair instructions enable loading two contiguous values in
memory to two independent floating-point registers. The target registers are required
to be odd and even physical registers so that the machine can utilize just one access
port to accomplish the register update.
Note: The odd/even pair restriction is on physical register numbers, not logical regis-
ter numbers. A programming violation of this rule will cause an illegal operation
fault.
For example, suppose a machine that can issue 2 FP instructions per cycle, provides
sufficient bandwidth from the second level cache (L2) to sustain 2 load-pairs every
cycle. Then loops that require up to 2 data elements (of 8 bytes each) per floating-point
instruction can run at peak speeds when the data is resident in L2. A common example
of such a case is a simple double precision dot product – DDOT:
DO 1 I = 1, N
1 C = C + A(I) * B(I)
1:216
r5
= f5
// Move S+Exp to int
r5
= r5, -1
// Sub 1 from Exp
f6
= r5
// Move S+Exp to FP
f5
= f6, f5
// Merge S+E w/ Mant
Volume 1, Part 2: Floating-point Applications

Advertisement

Table of Contents
loading

This manual is also suitable for:

Itanium architecture 2.3

Table of Contents