Accelerating Floating-Point Divides And Square Roots - AMD Athlon Processor x86 Optimization Manual

X86 code optimization
Table of Contents

Advertisement

22007E/0—November 1999

Accelerating Floating-Point Divides and Square Roots

Accelerating Floating-Point Divides and Square Roots
quadword alignment), so that quadword operands might be
misaligned, even if this technique is used and the compiler does
allocate variables in the order they are declared.
The following example demonstrates the reordering of local
variable declarations:
Original ordering (Avoid):
short
ga, gu, gi;
long
foo, bar;
double
x, y, z[3];
char
a, b;
float
baz;
Improved ordering (Preferred):
double
z[3];
double
x, y;
long
foo, bar;
float
baz;
short
ga, gu, gi;
See "Sort Variables According to Base Type Size" on page 56 for
more information from a different perspective.
Divides and square roots have a much longer latency than other
floating-point operations, even though the AMD Athlon
processor provides significant acceleration of these two
operations. In some codes, these operations occur so often as to
s e r i o u s ly i m p a c t p e r f o r m a n c e . I n t h e s e c a s e s , i t i s
recommended to port the code to 3DNow! inline assembly or to
use a compiler that can generate 3DNow! code. If code has hot
spots that use single-precision arithmetic only (i.e., all
computation involves data of type float) and for some reason
cannot be ported to 3DNow!, the following technique may be
used to improve performance.
The x87 FPU has a precision-control field as part of the FPU
control word. The precision-control setting determines what
precision results get rounded to. It affects the basic arithmetic
operations, including divides and square roots. AMD Athlon
®
and AMD-K6
family processors implement divide and square
root in such fashion as to only compute the number of bits
AMD Athlon™ Processor x86 Code Optimization
29

Advertisement

Table of Contents
loading

Table of Contents