Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 Manual page 1775

Hide thumbs Also See for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3:
Table of Contents

Advertisement

The MOVNTPS (Non-temporal store of packed single-precision floating-point)
instruction stores data from a SSE register to memory. The memory address must be
aligned to a 16-byte boundary; if it is not aligned, a general protection exception will
occur. The instruction is implicitly weakly-ordered, does not write-allocate and
minimizes cache pollution.
The main difference between a non-temporal store and a regular cacheable store is in
the write-allocation policy. The memory type of the region being written to can override
the non-temporal hint, leading to the following considerations:
• If the programmer specifies a non-temporal store to uncacheable memory, then the
store behaves like an uncacheable store; the non-temporal hint is ignored and the
memory type for the region is retained. Uncacheable as referred to here means that
the region being written to has been mapped with either a UC or WP memory type.
If the memory region has been mapped as WB, WT or WC, the non-temporal store
will implement weakly-ordered (WC) semantic behavior.
• If the programmer specifies a non-temporal store to cacheable memory, two cases
may result:
• If the data is present in the cache hierarchy, the instruction will ensure
consistency. A given processor may choose different ways to implement this;
some examples include: updating data in-place in the cache hierarchy while
preserving the memory type semantics assigned to that region, or evicting the
data from the caches and writing the new non-temporal data to memory (with
WC semantics).
• If the data is not present in the cache hierarchy, and the destination region is
mapped as WB, WT or WC, the transaction will be weakly ordered, and is
subject to all WC memory semantics. The non-temporal store will not write
allocate. Different implementations may choose to collapse and combine these
stores.
• In general, WC semantics require software to ensure coherence, with respect to
other processors and other system agents (such as graphics cards). Appropriate
use of synchronization and a fencing operation (see SFENCE, below) must be
performed for producer-consumer usage models. Fencing ensures that all system
agents have global visibility of the stored data; for instance, failure to fence may
result in a written cache line staying within a processor, and the line would not be
visible to other agents. For processors which implement non-temporal stores by
updating data in-place that already resides in the cache hierarchy, the destination
region should also be mapped as WC. Otherwise if mapped as WB or WT, there is
the potential for speculative processor reads to bring the data into the caches; in
this case, non-temporal stores would then update in place, and data would not be
flushed from the processor by a subsequent fencing operation.
• The memory type visible on the bus in the presence of memory type aliasing is
implementation specific. As one possible example, the memory type written to the
bus may reflect the memory type for the first store to this line, as seen in program
order; other alternatives are possible. This behavior should be considered reserved,
and dependency on the behavior of any particular implementation risks future
incompatibility.
The PREFETCH (Load 32 or greater number of bytes) instructions load either
non-temporal data or temporal data in the specified cache level. This access and the
cache level are specified as a hint. The prefetch instructions do not affect functional
behavior of the program and will be implementation specific.
Volume 4: IA-32 SSE Instruction Reference
4:473

Advertisement

Table of Contents
loading

This manual is also suitable for:

Itanium architecture 2.3

Table of Contents