Newton-Raphson Reciprocal Square Root; Use Mmx Pmaddwd Instruction To Perform Two 32-Bit Multiplies In Parallel - AMD Athlon Processor x86 Optimization Manual

X86 code optimization
Table of Contents

Advertisement

22007E/0—November 1999

Newton-Raphson Reciprocal Square Root

Use MMX™ PMADDWD Instruction to Perform Two 32-Bit
Multiplies in Parallel
Use MMX™ PMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel
The general Newton-Raphson reciprocal square root recurrence
is:
Z
= 1/2
Z
(3 – b
i+1
i
To reduce the number of iterations, the initial approximation
read from a table. The 3DNow! reciprocal square root
approximation is accurate to at least 15 bits. Accordingly, to
obtain a single-precision 24-bit reciprocal square root of an
input operand b, one Newton-Raphson iteration is required,
using the following sequence of 3DNow! instructions:
X
= PFRSQRT(b)
0
X
= PFMUL(X
,X
)
1
0
0
X
= PFRSQIT1(b,X
)
2
1
X
= PFRCPIT2(X
,X
3
2
0
X
= PFMUL(b,X
)
4
3
The 24-bit final reciprocal square root value is X
AMD Athlon processor 3DNow! implementation, the estimate
contains the correct round-to-nearest value for approximately
87% of all arguments. The remaining arguments differ from the
correct round-to-nearest value by one unit-in-the-last-place. The
square root (X
) is formed in the last step by multiplying by the
4
input operand b.
The MMX PMADDWD instruction can be used to perform two
signed 16x16→32 bit multiplies in parallel, with much higher
performance than can be achieved using the IMUL instruction.
The PMADDWD instruction is designed to perform four
16x16→32 bit signed multiplies and accumulate the results
pairwise. By making one of the results in a pair a zero, there are
now just two multiplies. The following example shows how to
multiply 16-bit signed numbers a,b,c,d into signed 32-bit
products a×c and b×d:
AMD Athlon™ Processor x86 Code Optimization
2
Z
)
i
)
. In the
3
111

Advertisement

Table of Contents
loading

Table of Contents