IA-32 Intel® Architecture Optimization
The examples that follow illustrate the use of coding adjustments to
enable the algorithm to benefit from the SSE. The same techniques may
be used for single-precision floating-point, double-precision
floating-point, and integer data under SSE2, SSE, and MMX
technology.
As a basis for the usage model discussed in this section, consider a
simple loop shown in Example 3-8.
Example 3-8
Simple Four-Iteration Loop
void add(float *a, float *b, float *c)
{
for (i = 0; i < 4; i++) {
c[i] = a[i] + b[i];
}
}
Note that the loop runs for only four iterations. This allows a simple
replacement of the code with Streaming SIMD Extensions.
For the optimal use of the Streaming SIMD Extensions that need data
alignment on the 16-byte boundary, all examples in this chapter assume
that the arrays passed to the routine,
boundaries by a calling routine. For the methods to ensure this
alignment, please refer to the application notes for the Pentium 4
processor.
The sections that follow provide details on the coding methodologies:
inlined assembly, intrinsics, C++ vector classes, and automatic
vectorization.
3-14
int i;
,
,
, are aligned to 16-byte
a
b
c
Need help?
Do you have a question about the ARCHITECTURE IA-32 and is the answer not in the manual?