22007E/0—November 1999
Schedule Instructions According to their Latency
Unrolling Loops
Complete Loop Unrolling
Schedule Instructions According to their Latency
Scheduling Optimizations
This chapter describes how to code instructions for efficient
scheduling. Guidelines are listed in order of importance.
The AMD Athlon™ processor can execute up to three x86
instructions per cycle, with each x86 instruction possibly having
a different latency. The AMD Athlon processor has flexible
scheduling, but for absolute maximum performance, schedule
instructions, especially FPU and 3DNow!™ instructions,
according to their latency. Dependent instructions will then not
have to wait on instructions with longer latencies.
See Appendix F, "Instruction Dispatch and Execution
Resources" on page 187 for a list of latency numbers.
Make use of the large AMD Athlon processor 64-Kbyte
instruction cache and unroll loops to get more parallelism and
reduce loop overhead, even with branch prediction. Complete
AMD Athlon™ Processor x86 Code Optimization
7
67