AMD Athlon Processor x86 Optimization Manual page 85

X86 code optimization
Table of Contents

Advertisement

22007E/0—November 1999
Unrolling Loops
Without Loop Unrolling:
MOV ECX, MAX_LENGTH
MOV EAX, OFFSET A
MOV EBX, OFFSET B
$add_loop:
FLD
QWORD PTR [EAX]
FADD
QWORD PTR [EBX]
FSTP
QWORD PTR [EAX]
ADD
EAX, 8
ADD
EBX, 8
DEC
ECX
JNZ
$add_loop
The loop consists of seven instructions. The AMD Athlon
processor can decode/retire three instructions per cycle, so it
cannot execute faster than three iterations in seven cycles, or
3/7 floating-point adds per cycle. However, the pipelined
floating-point adder allows one add every cycle. In the following
code, the loop is partially unrolled by a factor of two, which
creates potential endcases that must be handled outside the
loop:
With Partial Loop Unrolling:
MOV
ECX, MAX_LENGTH
MOV
EAX, offset A
MOV
EBX, offset B
SHR
ECX, 1
JNC
$add_loop
FLD
QWORD PTR [EAX]
FADD
QWORD PTR [EBX]
FSTP
QWORD PTR [EAX]
ADD
EAX, 8
ADD
EBX, 8
$add_loop:
FLD
QWORD PTR[EAX]
FADD
QWORD PTR[EBX]
FSTP
QWORD PTR[EAX]
FLD
QWORD PTR[EAX+8]
FADD
QWORD PTR[EBX+8]
FSTP
QWORD PTR[EAX+8]
ADD
EAX, 16
ADD
EBX, 16
DEC
ECX
JNZ
$add_loop
Now the loop consists of 10 instructions. Based on the
decode/retire bandwidth of three OPs per cycle, this loop goes
AMD Athlon™ Processor x86 Code Optimization
69

Advertisement

Table of Contents
loading

Table of Contents