Page 1
AMD Athlon Processor x86 Code Optimization Guide...
Page 2
Trademarks AMD, the AMD logo, AMD Athlon, K6, 3DNow!, and combinations thereof, K86, and Super7 are trademarks, and AMD-K6 is a registered trademark of Advanced Micro Devices, Inc. Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation.
Page 5
Recommendations for AMD-K6 Family and AMD Athlon Processor Blended Code ....41 Cache and Memory Optimizations Memory Size and Alignment Issues ......45 Avoid Memory Size Mismatches .
Page 6
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Scheduling Optimizations Schedule Instructions According to their Latency ....67 Unrolling Loops......... . . 67 Complete Loop Unrolling .
Page 7
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Signed Derivation for Algorithm, Multiplier, and Shift Factor ......... 95 Floating-Point Optimizations Ensure All FPU Data is Aligned .
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Revision History Date Description Added “About this Document” on page 1. Further clarification of “Consider the Sign of Integer Operands” on page 14. Added the optimization, “Use Array Style Instead of Pointer Style Code” on page 15.
22007E/0—November 1999 Introduction The AMD Athlon™ processor is the newest microprocessor in the AMD K86™ family of microprocessors. The advances in the AMD Athlon processor take superscalar operation and out-of-order execution to a new level. The AMD Athlon processor has been designed to efficiently execute code written for previous-generation x86 processors.
Page 18
Chapter 11: General x86 Optimizations Guidelines. L i s t s g e n e r i c optimizations techniques applicable to x86 processors. Appendix A: AMD Athlon Processor Microarchitecture. D e s c r i b e s detail the microarchitecture of the AMD Athlon processor. About this Document...
Appendix C: Implementation of Write Combining. D e s c r i b e s t h e algorithm used by the AMD Athlon processor to write combine. Appendix D: Performance Monitoring Counters. Describes the usage of the performance counters available in the AMD Athlon processor.
To reduce on-chip cache miss penalties and to avoid subsequent data load or instruction fetch stalls, the AMD Athlon processor has a dedicated high-speed backside L2 cache. The large 128-Kbyte L1 on-chip cache and the backside L2 cache allow the...
Page 21
As a decoupled decode/execution processor, the AMD Athlon processor makes use of a proprietary microarchitecture, which defines the heart of the AMD Athlon processor. With the inclusion of all these features, the AMD Athlon processor is capable of decoding, issuing, executing, and retiring multiple x86 instructions per cycle, resulting in superior scaleable performance.
Page 22
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 The coding techniques for achieving peak performance on the AMD Athlon processor include, but are not limited to, those for ® the AMD-K6, AMD-K6-2, Pentium , Pentium Pro, and Pentium II processors. However, many of these optimizations are not necessary for the AMD Athlon processor to achieve maximum performance.
G ro u p I I c o n t a i n s s e c o n d a ry o p t i m i z a t i o n s t h a t c a n Optimizations significantly improve the performance of the AMD Athlon processor. The optimizations in Group II are as follows:...
3DNow! PREFETCH and PREFETCHW instructions to increase the effective bandwidth to the AMD Athlon processor, which sig n ific a n tly im p roves p er fo rma n c e. A ll t h e p ref e tch...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 anywhere, in any type of code (integer, x87, 3DNow!, MMX, etc.). Use the following formula to determine prefetch distance: Prefetch Length = 200 ( Round up to the nearest cache line. DS is the data stride per loop iteration.
B I O S p rog ra m m e rs . I n o rd e r t o i m p rove s y s t e m performance, the AMD Athlon processor aggressively combines multiple memory-write cycles of any data size that address locations within a 64-byte cache line aligned write buffer.
22007E/0—November 1999 Avoid Placing Code and Data in the Same 64-Byte Cache Line Consider that the AMD Athlon processor cache line is twice the size of previous processors. Code and data should not be shared in the same 64-byte cache line, especially if the data ever becomes modified.
Page 28
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Group II Optimizations—Secondary Optimizations...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 C Source Level Optimizations This chapter details C programming practices for optimizing code for the AMD Athlon™ processor. Guidelines are listed in order of importance. Ensure Floating-Point Variables and Expressions are of Type Float For compilers that generate 3DNow!™...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Consider the Sign of Integer Operands In many cases, the data stored in integer variables determines whether a signed or an unsigned integer type is appropriate. For example, to record the weight of a person in pounds, no negative numbers are required so an unsigned type is appropriate.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example (Avoid): int i; ====> EAX, i i = i / 4; EDX, 3 EAX, EDX EAX, 2 i, EAX Example (Preferred): unsigned int i; ====> i, 2 i = i / 4;...
Page 32
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Note that source code transformations will interact with a compiler’s code generator and that it is difficult to control the generated machine code from the source level. It is even possible that source code transformations for improving performance and compiler optimizations "fight"...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Completely Unroll Small Loops Take advantage of the AMD Athlon processor’s large, 64-Kbyte instruction cache and completely unroll small loops. Unrolling loops can be beneficial to performance, especially if the loop body is small which makes the loop overhead significant. Many compilers are not aggressive at unrolling loops.
Page 35
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 code in a way that avoids the store-to-load dependency. In some instances the language definition may prohibit the compiler from using code transformations that would remove the store- to-load dependency. It is therefore recommended that the programmer remove the dependency manually, e.g., by...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Consider Expression Order in Compound Branch Conditions Branch c ondit ions in C prog rams are oft en com pound conditions consisting of multiple boolean expressions joined by the boolean operators && and ||. C guarantees a short-circuit evaluation of these operators.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Switch Statement Usage Optimize Switch Statements Switch statements are translated using a variety of algorithms. The most common of these are jump tables and comparison chains/trees. It is recommended to sort the cases of a switch statement according to the probability of occurrences, with the most probable first.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use Const Type Qualifier Use the “const” type qualifier as much as possible. This optimization makes code more robust and may enable higher performance code to be generated due to the additional information available to the compiler.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Generalization for Multiple Constant Control Code To generalize this further for multiple constant control code some more work may have to be done to create the proper outer loop. Enumeration of the constant cases will reduce this to a simple switch statement.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 case combine( 1, 1 ): for( i ... ) { DoWork1( i ); DoWork3( i ); break; default: break; The trick here is that there is some up-front work involved in generating all the combinations for the switch constant and the total amount of code has doubled.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 w h i ch m i g h t i nh ib it c e rt a i n o p t i m i z a t i o n s w i t h so m e compilers—for example, aggressive inlining.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 lead to unexpected results. Fortunately, in the vast majority of cases, the final result will differ only in the least significant bits. Example 1 (Avoid): double a[100],sum; int i; sum = 0.0f;...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 Avoid: double a,b,c,d,e,f; e = b*c/d; f = b/d*a; Preferred: double a,b,c,d,e,f,t; t = b/d; e = c*t; f = a*t; Example 2 Avoid: double a,b,c,e,f; e = a/c; f = b/c;...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Pad by Multiple of Pad the structure to a multiple of the largest base type size of Largest Base Type any member. In this fashion, if the first member of a structure is...
The x87 FPU has a precision-control field as part of the FPU control word. The precision-control setting determines what precision results get rounded to. It affects the basic arithmetic operations, including divides and square roots. AMD Athlon ® and AMD-K6...
Page 46
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 necessary for the currently selected precision. This means that setting precision control to single precision (versus Win32 default of double precision) lowers the latency of those operations. ® The Microsoft Visual C environment provides functions to manipulate the FPU control word and thus the precision control.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Avoid Unnecessary Integer Division Integer division is the slowest of all integer arithmetic operations and should be avoided wherever possible. One possibility for reducing the number of integer divisions is multiple divisions, in which division can be replaced with multiplication as shown in the following examples.
Page 48
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): //assumes pointers are different and q!=r void isqrt ( unsigned long a, unsigned long *q, unsigned long *r) *q = a; if (a > 0) while (*q > (*r = a / *q)) *q = (*q + *r) >>...
Instruction Decoding Optimizations This chapter discusses ways to maximize the number of instructions decoded by the instruction decoders in the AMD Athlon™ processor. Guidelines are listed in order of importance. Overview The AMD Athlon processor instruction fetcher reads 16-byte aligned code windows from the instruction cache. The instruction bytes are then merged into a 24-byte instruction queue.
D i re c t Pa t h i n s t r u c t i o n s i n t h e AMD Athlon processor. Assembly writers must still take into consideration the usage of DirectPath versus VectorPath instructions.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use Load-Execute Floating-Point Instructions with Floating-Point Operands When operating on single-precision or double-precision floating-point data, wherever possible use floating-point load-execute instructions to increase code density. Note: This optimization applies only to floating-point instructions with floating-point operands and not with integer operands, as described in the next optimization.
;uses 1-byte opcode, ; 8-bit immediate Avoid Partial Register Reads and Writes In order to handle partial register writes, the AMD Athlon processor execution core implements a data-merging scheme. In the execution unit, an instruction writing a partial register merges the modified portion with the current state of the remainder of the register.
LEA REG1, [REG1*8 + REG2] Use 8-Bit Sign-Extended Immediates Using 8-bit sign-extended immediates improves code density with no negative effects on the AMD Athlon processor. For example, ADD BX, –5 should be encoded “83 C3 FB” and not “81 C3 FF FB”.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use 8-Bit Sign-Extended Displacements Use 8-bit sign-extended displacements for conditional branches. Using short, 8-bit sign-extended displacements for conditional branches improves code density with no negative effects on the AMD Athlon processor. Code Padding Using Neutral Code Fillers Occasionally a need arises to insert neutral code fillers into the code stream, e.g., for code alignment purposes or to space out...
For code that is optimized specifically for the AMD Athlon processor, the optimal code fillers are NOP instructions (opcode 0x90) with up to two REP prefixes (0xF3). In the AMD Athlon processor, a NOP with up to two REP prefixes can be handled by a single decoder with no overhead.
T h e f o l l o w i n g a s s e m b ly l a n g u a g e m a c r o s s h o w t h e recommended neutral code fillers for code optimized for the AMD Athlon processor that also has to run well on other x86 processors. Note for some padding lengths, versions using ESP or EBP are missing due to the lack of fully generalized addressing modes.
Cache and Memory Optimizations This chapter describes code optimization techniques that take advantage of the large L1 caches and high-bandwidth buses of the AMD Athlon™ processor. Guidelines are listed in order of importance. Memory Size and Alignment Issues Avoid Memory Size Mismatches Avoid memory size mismatches when instructions operate on the same data.
3DNow! PREFETCH and PREFETCHW instructions to increase the effective bandwidth to the AMD Athlon processor. Th e P R E F E T C H a n d P R E F E T C H W i n s t r u c t i o n s t a ke advantage of the AMD Athlon processor’s high bus bandwidth...
Page 63
PREFETCHW works the same as a PREFETCH on the AMD-K6-2 and AMD-K6-III processors, PREFETCHW gives a hint to the AMD Athlon processor of an intent to modify the cache line. The AMD Athlon processor will mark the cache line being brought in by PREFET CHW as Modified. Using...
Page 64
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ECX, (-LARGE_NUM) ;used biased index EAX, OFFSET array_a ;get address of array_a EDX, OFFSET array_b ;get address of array_b ECX, OFFSET array_c ;get address of array_c $loop: PREFETCHW [EAX+196] ;two cachelines ahead...
Page 65
Determining Prefetch Given the latency of a typical AMD Athlon processor system Distance and expected processor speeds, the following formula should be...
-c o m b i n i n g c a p ab il it ie s o f t h e AMD Athlon processor. The AMD Athlon processor has a very aggressive write-combining algorithm, which improves performance significantly.
Store-to-load forwarding refers to the process of a load reading (forwarding) data from the store buffer (LS2). There are instances in the AMD Athlon processor load/store architecture when either a load operation is not allowed to read needed data from a store in the store buffer, or a load OP detects a false data dependency on a store in the store buffer.
Page 68
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Narrow-to-Wide I f t h e f o l l o w i n g c o n d i t i o n s a re p re s e n t , t h e re i s a...
Page 69
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 5 (Preferred): MOVD [foo], MM1 ;store lower half PUNPCKHDQ MM1, MM1 ;get upper half into lower half MOVD [foo+4], MM1 ;store lower half EAX, [foo] ;fine EDX, [foo+4] ;fine Misaligned If the following condition is present, there is a misaligned...
One Supported Store- There is one case of a mismatched store-to-load forwarding that to-Load Forwarding is supported by the by AMD Athlon processor. The lower 32 bits Case from an aligned QWORD write feeding into a DWORD read is allowed.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example (Preferred): Prolog: PUSH EBP, ESP ESP, SIZE_OF_LOCALS ;size of local variables ESP, –8 ;push registers that need to be preserved Epilog: ;pop register that needed to be preserved ESP, EBP With this technique, function arguments can be accessed via EBP, and local variables can be accessed via ESP.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example: struct char a[5]; long doublex; } baz; The structure components should be allocated (lowest to highest address) as follows: x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, ..., padbyte0 See “C Language Structure Component Considerations” on page 27 for more information from a C source code perspective.
2 illustrate this concept using the CMOV instruction. Note ® that the AMD-K6 processor does not support the CMOV instruction. Therefore, blended AMD-K6 and AMD Athlon processor code should use examples 3 and 4. Avoid Branches Dependent on Random Data...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Replace Branches with Computation in 3DNow!™ Code Branches negatively impact the performance of 3DNow! code. Branches can operate only on one data item at a time, i.e., they are inherently scalar and inhibit the SIMD processing that makes 3DNow! code superior.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 2 (Preferred): ; r = (x < y) ? a : b ; in: ; out: mm1 PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0 PAND MM1, MM3 ;...
Page 78
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 2: C code: float x,z; z = abs(x); if (z >= 1) { z = 1/z; 3DNow! code: ;in: MM0 = x ;out: MM0 = z MOVQ MM5, mabs ;0x7fffffff PAND MM0, MM5 ;z=abs(x)
Page 79
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 4: C code: #define PI 3.14159265358979323 float x,z,r,res; /* 0 <= r <= PI/4 */ z = abs(x) if (z < 1) { res = r; else { res = PI/2-r;...
Page 80
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 5: C code: #define PI 3.14159265358979323 float x,y,xa,ya,r,res; xs,df; xs = x < 0 ? 1 : 0; xa = fabs(x); ya = fabs(y); df = (xa < ya); if (xs && df) { res = PI/2 + r;...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Avoid the Loop Instruction The LOOP instruction in the AMD Athlon processor requires eight cycles to execute. Use the preferred code shown below: Example 1 (Avoid): LOOP LABEL Example 2 (Preferred): LABEL Avoid Far Control Transfer Instructions Avoid using far control transfer instructions.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Avoid Recursive Functions Avoid recursive functions due to the danger of overflowing the return address stack. Convert end-recursive functions to iterative code. An end-recursive function is when the function call to itself is at the end of the code.
Guidelines are listed in order of importance. Schedule Instructions According to their Latency The AMD Athlon™ processor can execute up to three x86 instructions per cycle, with each x86 instruction possibly having a different latency. The AMD Athlon processor has flexible scheduling, but for absolute maximum performance, schedule instructions, especially FPU and 3DNow!™...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 unrolling reduces register pressure by removing the loop counter. To completely unroll a loop, remove the loop control and replicate the loop body N times. In addition, completely unrolling a loop increases scheduling opportunities.
Page 85
EAX, 8 EBX, 8 $add_loop The loop consists of seven instructions. The AMD Athlon processor can decode/retire three instructions per cycle, so it cannot execute faster than three iterations in seven cycles, or 3/7 floating-point adds per cycle. However, the pipelined floating-point adder allows one add every cycle.
Page 86
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 n o f a s t e r t h a n t h re e i t e ra t i o n s i n 1 0 cy c l e s , o r 6 / 1 0 floating-point adds per cycle, or 1.4 times as fast as the original...
AMD Athlon processor is less susceptible than other processors to the negative side effect of function inlining. Function call overhead on the AMD Athlon processor can be low because calls and returns are executed at high speed due to the use of prediction mechanisms.
Avoid Address Generation Interlocks Loads and stores are scheduled by the AMD Athlon processor to access the data cache in program order. Newer loads and stores with their addresses calculated can be blocked by older loads and stores whose addresses are not yet calculated –...
Page 90
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i; for (i=0; i < MAXSIZE; i++) { c [i] = a[i] + b[i]; ECX, MAXSIZE ;initialize loop counter ESI, ESI ;initialize offset into array a EDI, EDI ;initialize offset into array b...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 variable that starts with a negative value and reaches zero when the loop expires. Note that if the base addresses are held in registers (e.g., when the base addresses are passed as...
Page 92
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Push Memory Data Carefully...
Replace Divides with Multiplies Replace integer division by constants with multiplication by the reciprocal. Because the AMD Athlon™ processor has a very fast integer multiply (5–9 cycles signed, 4–8 cycles unsigned) and the integer division delivers only one bit of quotient per cycle (22–47 cycles signed, 17–41 cycles unsigned), the...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Signed Division In the opt_utilities directory of the AMD documentation Utility CDROM, run sdiv.exe in a DOS window to find the fastest code for signed division by a constant. The utility displays the code after the user enters a signed constant divisor.
Page 96
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ;algorithm 1 EAX, m EDX, dividend ECX, EDX IMUL EDX, ECX ECX, 31 EDX, s EDX, ECX ;quotient in EDX Derivation for a, m, s The derivation for the algorithm (a), multiplier (m), and shift count (s), is found in the section “Signed Derivation for...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 In addition, using MMX instructions increases the available parallelism. The AMD Athlon processor can issue three integer OPs and two MMX OPs per cycle. Repeated String Instruction Usage Latency of Repeated String Instructions Table 1 shows the latency for repeated string instructions on the AMD Athlon processor.
Page 101
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Ensure DF=0 (UP) Always make sure that DF = 0 (UP) (after execution of CLD) for REP MOVS and REP STOS. DF = 1 (DOWN) is only needed for certain cases of overlapping REP MOVS (for example, source and destination overlap).
Use XOR Instruction to Clear Integer Registers To clear an integer register to all 0s, use “XOR reg, reg”. The AMD Athlon processo r is able to avoid the false rea d dependency on the XOR instruction. Example 1 (Acceptable):...
Page 103
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 4 (Left shift): ;shift operand in EDX:EAX left, shift count in ECX (count applied modulo 64) SHLD EDX, EAX, CL ;first apply shift count EAX, CL ; mod 32 to EDX:EAX...
Page 104
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 7 (Division): ;_ulldiv divides two unsigned 64-bit integers, and returns the quotient. ;INPUT: [ESP+8]:[ESP+4] dividend [ESP+16]:[ESP+12] divisor ;OUTPUT: EDX:EAX quotient of division ;DESTROYS: EAX,ECX,EDX,EFlags _ulldiv PROC PUSH ;save EBX as per calling convention ECX, [ESP+20] ;divisor_hi...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Efficient Implementation of Population Count Function Population count is an operation that determines the number of set bits in a bit string. For example, this can be used to determine the cardinality of a set. The following example code shows how to efficiently implement a population count operation for 32-bit operands.
Page 108
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Step 3 For the first time, the value in each k-bit field is small enough that adding two k-bit fields results in a value that still fits in the k-bit field. Thus the following computation is performed: y = (x + (x >>...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 /* Generate m, s for algorithm 1. Based on: Magenheimer, D.J.; et al: “Integer Multiplication and Division on the HP Precision Architecture”. IEEE Transactions on Computers, Vol 37, No. 8, August 1988, page 980. */ else { s = log2(d);...
Page 112
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ;algorithm 1 EAX, m EDX, dividend ECX, EDX IMUL EDX, ECX ECX, 31 EDX, s EDX, ECX ; quotient in EDX typedef unsigned __int64 U64; typedef unsigned long U32; U32 log2 (U32 i) U32 t = 0;...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Optimizations T h i s ch a p t e r d e t a i l s t h e m e t h o d s u s e d t o o p t i m i z e floating-point code to the pipelined floating-point unit (FPU).
;removes one register from stack FCOMPP ;removes two registers from stack On the AMD Athlon processor, a faster alternative is to use the FFREEP instruction below. Note that the FFREEP instruction, although insufficiently documented in the past, is supported by all 32-bit x86 processors.
Although the AMD Athlon processor FPU has a deep scheduler, which in most cases can extract sufficient parallelism from existing code, long dependency chains can stall the scheduler while issue slots are still available.
FLDCW [SAVE_CW] ;restore original control word The AMD Athlon processor contains special acceleration hardware to execute such code as quickly as possible. In most situations, the above code is therefore the fastest way to perform floating-point-to-integer conversion and the conversion is compliant both with programming language standards and the IEEE-754 standard.
Page 117
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 FP U into tr uncating mo de, and perfor ming all of the conversions before restoring the original control word. The speed of the above code is somewhat dependent on the nature of the code surrounding it. For applications in which the...
Page 118
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 3 (Potentially faster): ECX, DWORD PTR[X+4] ;get upper 32 bits of double EDX, EDX ;i = 0 EAX, ECX ;save sign bit ECX, 07FF00000h ;isolate exponent field ECX, 03FF00000h ;if abs(x) < 1.0 $DONE2 ;...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Subexpression Elimination There are cases which do not require an FXCH instruction after every instruction to allow access to two new stack entries. In the cases where two instructions share a source operand, an FXCH is not required between the two instructions.
Page 120
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 If an “argument out of range” is detected, a range reduction subroutine is invoked which reduces the argument to less than 2^63 before the instruction is attempted again. While an argument > 2^63 is unusual, it often indicates a problem elsewhere in the code and the code may completely fail in the absence of a properly guarded trigonometric instruction.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Since out-of-range arguments are extremely uncommon, the conditional branch will be perfectly predicted, and the other instructions used to guard the trigonometric instruction can execute in parallel to it. Take Advantage of the FSINCOS Instruction...
Page 122
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Take Advantage of the FSINCOS Instruction...
3DNow!™ and MMX™ Optimizations This chapter describes 3DNow! and MMX code optimization techniques for the AMD Athlon™ processor. Guidelines are listed in order of importance. 3DNow! porting guidelines can be found in the 3DNow!™ Instruction Porting Guide, order# 22621. Use 3DNow!™ Instructions...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 FEMMS instruction is supported for backward compatibility with AMD-K6 family processors, and is aliased to the EMMS instruction. 3DNow! and MMX instructions are designed to be used concurrently with no switching issues. Likewise, enhanced 3DNow! instructions can be used simultaneously with MMX instructions.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Pipelined Pair of 24-Bit Precision Divides This divide operation executes with a total latency of 21 cycles, assuming that the program hides the latency of the first MOVD/MOVQ instructions within preceding code.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use 3DNow!™ Instructions for Fast Square Root and Reciprocal Square Root 3DNow! instructions can be used to compute a very fast, highly accurate square root and reciprocal square root. Optimized 15-Bit Precision Square Root...
= PFMUL(b,X The 24-bit final reciprocal square root value is X . In the AMD Athlon processor 3DNow! implementation, the estimate contains the correct round-to-nearest value for approximately 87% of all arguments. The remaining arguments differ from the correct round-to-nearest value by one unit-in-the-last-place. The...
= mmreg2[31:0]) mmreg1[31:0] = mmreg2[63:32]) See the AMD Extensions to the 3DNow! and MMX Instruction Set Manual, order #22466 for more usage information. Blended Code Otherwise, for blended code, which needs to run well on...
AMD processors. The first example shows how to do the conversion on a processor that supports AMD ’s 3 DN ow! ex te n si on s, such as t h e AMD Athlon processor. It demonstrates the increased efficiency from using the PI2FW instruction.
PXOR and PMUL instructions are the same in terms of latency. On the AMD-K6 processor, there is only a one cycle latency for PXOR, versus a two cycle latency for the 3DNow! PFMUL instruction.
AMD Athlon processor specific code where the destination is in cacheable memory and immediate data re-use of the data at the destination is expected AMD-K6 family specific code where the destination is in non-cacheable memory Example 1: /* block copy (source and destination QWORD aligned) */...
Page 133
Microsoft Visual C, is suitable for moving/filling a quadword Code aligned block of data in the following situations: AMD Athlon processor specific code where the destination of the block copy is in non-cacheable memory space AMD Athlon processor specific code where the destination of the block copy is in cacheable space, but no immediate data re-use of the data at the destination is expected.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register To set all the bits in an MMX register to one, use: PCMPEQD MMreg, MMreg Note that PCMPEQD MMreg, MMreg is dependent on previous writes to MMreg.
Page 136
EBX, [RES] ;EBX = destination vector ptr ECX, [NUMVERTS] ;ECX = number of vertices to transform ;3DNow! version of fully general 3D vertex tranformation. ;Optimal for AMD Athlon (completes in 16 cycles) FEMMS ;clear MMX state ALIGN ;for optimal branch alignment...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Efficient 3D-Clipping Code Computation Using 3DNow!™ Instructions Clipping is one of the major activities occurring in a 3D graphics pipeline. In many instances, this activity is split into two parts which do not necessarily have to occur consecutively:...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 The following code fragment uses the 3DNow! PAVGUSB ins tr uction to perfor m averaging between the source macroblock and destination macroblock: Example 2 (Preferred): EAX, DWORD PTR Src_MB EDI, DWORD PTR Dst_MB...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Complex Number Arithmetic Complex numbers have a “real” part and an “imaginary” part. Multiplying complex numbers (ex. 3 + 4i) is an integral part of many algorithms such as Discrete Fourier Transform (DFT) and complex FIR filters.
AMD-K6 processor, Pentium, and Pentium Pro processors either improve the performance of the AMD Athlon processor or are not required and have a neutral effect (usually due to fewer coding restrictions with the AMD Athlon processor).
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Dependencies Spread out true dependencies to increase the opportunities for p a ra l l e l e x e c u t i o n . A n t i -d e p e n d e n c i e s a n d o u t p u t dependencies do not impact performance.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Appendix A AMD Athlon™ Processor Microarchitecture Introduction When discussing processor design, it is important to understand the following terms—architecture, microarchitecture, and design implementation. The term architecture refers to the instruction set and features of a processor that are visible to software p rog ra m s r u n n ing o n t h e p ro c e s so r.
Instead of executing complex x86 instructions, which have lengths from 1 to 15 bytes, the AMD Athlon processor executes the simpler fixed-length OPs, while maintaining the instruction coding efficiencies found in x86 programs. The enhanced microarchitecture used in the...
L2 SRAMs Figure 1. AMD Athlon™ Processor Block Diagram Instruction Cache The out-of-order execute engine of the AMD Athlon processor contains a very large 64-Kbyte L1 instruction cache. The L1 instruction cache is organiz ed as a 64-Kbyte, two-way, set-associative array. Each line in the instruction array is 64 bytes long.
The AMD Athlon processor employs combinations of a branch target address buffer (BTB), a global history bimodal counter (GHBC) table, and a return address stack (RAS) hardware in order to predict and accelerate branches.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 return stack. Subsequent RETs pop a predicted return address off the top of the stack. Early Decoding T h e D i re c t Pa t h a n d Ve c t o r Pa t h d e c o d e r s p e r f o r m early-decoding of instructions into MacroOPs.
22007E/0—November 1999 Instruction Control Unit The instruction control unit (ICU) is the control center for the AMD Athlon processor. The ICU controls the following resources—the centralized in-flight reorder buffer, the integer scheduler, and the floating-point scheduler. In turn, the ICU is responsible for the following functions —...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Integer Scheduler The integer scheduler is based on a three-wide queuing system (also known as a reservation station) that feeds three integer execution positions or pipes. The reservation stations are six entries deep, for a total queuing system of 18 integer MacroOPs.Each reservation station divides the MacroOPs into...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Each of the three IEUs are general purpose in that each performs logic functions, arithmetic functions, conditional functions, divide step functions, status flag multiplexing, and branch resolutions. The AGUs calculate the logical addresses for loads, stores, and LEAs.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Execution Unit The floating-point execution unit (FPU) is implemented as a coprocessor that has its own out-of-order control in addition to the data path. The FPU handles all register operations for x87 instructions, all 3DNow! operations, and all MMX operations.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Load-Store Unit (LSU) The load-store unit (LSU) manages data load and store accesses to the L1 data cache and, if required, to the backside L2 cache or system memory. The 44-entry LSU provides a data interface for both the integer scheduler and the floating-point scheduler.
155 for detailed information about write combining. AMD Athlon™ System Bus The AMD Athlon system bus is a high-speed bus that consists of a pair of unidirectional 13-bit address and control channels and a bidirectional 64-bit data bus. The AMD Athlon system bus supports low-voltage swing, multiprocessing, clock forwarding, and fast data transfers.
Fetch and Decode Pipeline Stages Figure 5 on page 142 and Figure 6 on page 142 show the AMD Athlon processor instruction fetch and decoding pipeline stages. The pipeline consists of one cycle for instruction fetches and four cycles of instruction alignment and decoding. The...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 E n try V ec to rP ath V ec to rP ath P o in t D ec o d e D ec o d e M R O M D ec o d e...
Page 159
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Cycle 1–FETCH The FETCH pipeline stage calculates the address of the next x86 instruction window to fetch from the processor caches or system memory. Cycle 2–SCAN SCAN determines the start and end pointers of instructions.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 operands mapped to registers. Both integer and floating-point MacroOPs are placed into the ICU. Integer Pipeline Stages The integer execution pipeline consists of four or more stages for scheduling and execution and, if necessary, accessing data in the processor caches or system memory.
Page 161
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Cycle 7–SCHED In the scheduler (SCHED) pipeline stage, the scheduler buffers can contain MacroOPs that are waiting for integer operands from the ICU or the IEU result bus. When all operands are received, SCHED schedules the MacroOP for execution and issues the OPs to the next stage, EXEC.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Pipeline Stages The floating-point unit (FPU) is implemented as a coprocessor that has its own out-of-order control in addition to the data path. The FPU handles all register operations for x87 instructions, all 3DNow! operations, and all MMX operations.
Page 163
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Cycle 7–STKREN The stack rename (STKREN) pipeline stage in cycle 7 receives up to three MacroOPs from IDEC and maps stack-relative register tags to virtual register tags. Cycle 8–REGREN The register renaming (REGREN) pipeline stage in cycle 8 is responsible for register renaming.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Execution Unit Resources Terminology The execution units operate with two types of register values— operands and results. There are three operand types and two result types, which are described in this section.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Integer Pipeline Operations Table 2 shows the category or type of operations handled by the integer pipeline. Table 3 shows examples of the decode type. Table 2. Integer Pipeline Operation Types Category...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Pipeline Operations Table 4 shows the category or type of operations handled by the floating-point execution units. Table 5 shows examples of the decode types. Table 4. Floating-Point Pipeline Operation Types...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Load/Store Pipeline Operations The AMD Athlon processor decodes any instruction that references memory into primitive load/store operations. For example, consider the following code sample: AX, [EBX] ;1 load MacroOP PUSH ;1 store MacroOP ;1 load MacroOP...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Code Sample Analysis The samples in Table 7 on page 153 and Table 8 on page 154 show the execution behavior of several series of instructions as a function of decode constraints, dependencies, and execution resource constraints.
Write Combining Introduction This appendix describes the memory write-combining feature as implemented in the AMD Athlon™ processor family. The AMD Athlon processor supports the memory type and range register (MTRR) and the page attribute table (PAT) extensions, which allow software to define ranges of memory as either writeback (WB), write-protected (WP), writethrough (WT), uncacheable (UC), or write-combining (WC).
The steps required for programming write combining on the AMD Athlon processor are as follows: 1. Verify the presence of an AMD Athlon processor by using the CPUID instruction to check for the instruction family code and vendor identification of the processor. Standard...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 signature in register EAX, where EAX[11–8] contains the instruction family code. For the AMD Athlon processor, the instruction family code is six. 2. In addition, the presence of the MTRRs is indicated by bit...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 9. Write Combining Completion Events Event Comment The first non-WB write to a different cache block address closes combining for previous writes. WB writes do not affect Non-WB write outside of write combining.
Once write combining is closed for a 64-byte write buffer, the contents of the write buffer are eligible to be sent to the system as one or more AMD Athlon system bus commands. Table 10 lists the rules for determining what system commands are issued for a write buffer, as a function of the alignment of the valid buffer data.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Appendix D Performance-Monitoring Counters This chapter describes how to use the AMD Athlon™ processor performance monitoring counters. Overview The AMD Athlon processor provides four 48-bit performance counters, which allows four types of events to be monitored simultaneously.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 These registers can be read from and written to using the RDMSR and WRMSR instructions, respectively. The PerfEvtSel[3:0] registers are located at MSR locations C001_0000h to C001_0003h. The PerfCtr[3:0] registers are located at MSR locations C001_0004h to C0001_0007h and are 64-byte registers.
Page 179
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Unit Mask Field (Bits These bits are used to further qualify the event selected in the 8—15) event select field. For example, for some cache events, the mask is used as a MESI-protocol qualifier of cache states. See Table 11 on page 164 for a list of unit masks and their 8-bit codes.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 greater than or equal to the counter mask. Otherwise if this field is zero, then the counter increments by the total number of events. Table 11. Performance-Monitoring Counters Event Source Notes / Unit Mask (bits 15–8)
Page 181
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 11. Performance-Monitoring Counters (Continued) Event Source Notes / Unit Mask (bits 15–8) Event Description Number Unit 1xxx_xxxxb = reserved x1xx_xxxxb = WB xx1x_xxxxb = WP System requests with the selected type xxx1_xxxxb = WT bits 11–10 = reserved...
Page 182
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 11. Performance-Monitoring Counters (Continued) Event Source Notes / Unit Mask (bits 15–8) Event Description Number Unit Cycles that at least one fill request waited to use the L2 Instruction cache fetches...
Page 183
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 11. Performance-Monitoring Counters (Continued) Event Source Notes / Unit Mask (bits 15–8) Event Description Number Unit ICU full Reservation stations full FPU full LS full All quiet stall Far transfer or resync branch pending...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 allows writing both positive and negative values to the performance counters. The performance counters may be initialized using a 64-bit signed integer in the range -2 . Negative values are useful for generating an interrupt after a specific number of events.
RDTSC and RDPMC instructions, which allow application code to read the counters directly. Monitoring Counter Overflow The AMD Athlon processor provides the option of generating a debug interrupt when a performance-monitoring counter overflows. This mechanism is enabled by setting the interrupt enable flag in one of the PerfEvtSel[3:0] MSRs.
Page 186
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 An event monitor application utility or another application program can read the collected performance information of the profiled application. Monitoring Counter Overflow...
Appendix E Programming the MTRR and Introduction The AMD Athlon™ processor includes a set of memory type and range registers (MTRRs) to control cacheability and access to specified memory regions. The processor also includes the Page Address Table for defining attributes of pages. This chapter documents the use and capabilities of this feature.
Page 188
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 There are two types of address ranges: fixed and variable. (See Figure 12.) For each address range, there is a memory type. For each 4K, 16K or 64K segment within the first 1 Mbyte of memory, there is one fixed address MTRR.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Memory Types Five standard memory types are defined by the AMD Athlon processor: writethrough (WT), writeback (WB), write-protect (WP), write-combining (WC), and uncacheable (UC). These are described in Table 12 on page 174.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 MTRR Default Type Register Format. The MTRR default type register is defined as follows. Type Reserved Symbol Description Bits MTRRs Enabled Fixed Range Enabled Type Default Memory Type 7–0 Figure 14. MTRR Default Type Register Format MTRRs are enabled when set.
When a large page (2 Mbytes/4 Mbytes) mapping covers a region that contains more than one memory type (as mapped by the MTRRs), the AMD Athlon processor does not suppress the caching of that large page mapping and only caches the mapping for just that 4-Kbyte piece in the 4-Kbyte TLB.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 not affected by this issue, only the variable range (and MTRR DefType) registers are affected. Page Attribute Table (PAT) The Page Attribute Table (PAT) is an extension of the page table entry format, which allows the specification of memory types to regions of physical memory based on the linear address.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Accessing the PAT A 3-bit index consisting of the PATi, PCD, and PWT bits of the page table entry, is used to select one of the seven PAT register fields to acquire the memory type for the desired page (PATi is defined as bit 7 for 4-Kbyte PTEs and bit 12 for PDEs which map to 2-Mbyte or 4-Mbyte pages).
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 15. Effective Memory Type Based on PAT and MTRRs PAT Memory Type MTRR Memory Type Effective Memory Type WB, WT, WP, WC UC-Page UC-MTRR WB, WT WB, WP UC-MTRR WC, WT Notes: 1.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 16. Final Output Memory Types Input Memory Type Output Memory Type AMD-751 Note 1, 2 Page Attribute Table (PAT)
Page 197
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 16. Final Output Memory Types (Continued) Input Memory Type Output Memory Type AMD-751 Note Notes: 1. WP is not functional for RdMem/WrMem. 2. ForceCD must cause the MTRR memory type to be ignored in order to avoid x’s.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 MTRR Fixed-Range The memory types defined for memory segments defined in Register Format each of the MTRR fixed-range registers are defined in Table 17 (Also See “Standard MTRR Types and Propert ies” on page 176.).
The variable address range is power of 2 sized and aligned. The Register Format range of supported sizes is from 2 to 2 in powers of 2. The AMD Athlon processor does not implement A[35:32]. Type Physical Base Reserved Symbol...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Physical Mask Reserved Symbol Description Bits Physical Mask 24-Bit Mask 35–12 Variable Range Register Pair Enabled 11 (V = 0 at reset) Figure 17. MTRRphysMaskn Register Format Note: A software attempt to write to reserved bits will generate a general protection exception.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 MTRR MSR Format This table defines the model-specific registers related to the memory type range register implementation. All MTRRs are defined to be 64 bits. Table 18. MTRR-Related Model-Specific Register (MSR) Map...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Appendix F Instruction Dispatch and Execution Resources This chapter describes the MacroOPs generated by each decoded instruction, along with the relative static execution latencies of these groups of operations. Tables 19 through 24 starting on page 188 define the integer, MMX™, MMX...
DirectPath or VectorPath (see “DirectPath Decoder” on page 13 3 and “Vec t orPa th D ec od e r” on pag e 13 3 fo r m o re information). The AMD Athlon™ processor enhanced decode logic can process three instructions per clock.
Page 212
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type JP/JPE near disp16/32 DirectPath JNP/JPO near disp16/32 DirectPath JL/JNGE near disp16/32 DirectPath JNL/JGE near disp16/32 DirectPath JLE/JNG near disp16/32...
Page 215
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type NOT mem8 mm-010-xx DirectPath NOT mreg16/32 11-010-xxx DirectPath NOT mem16/32 mm-010-xx DirectPath OR mreg8, reg8 11-xxx-xxx DirectPath...
Page 216
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type POP EBX VectorPath POP ESP VectorPath POP EBP VectorPath POP ESI VectorPath POP EDI VectorPath POP mreg 16/32...
DirectPath Instructions The following tables contain DirectPath instructions, which should be used in the AMD Athlon processor wherever possible: Table 25, “DirectPath Integer Instructions,” on page 220 Table 26, “DirectPath MMX™ Instructions,” on page 227 and Table 27, “DirectPath MMX™ Extensions,” on page 228 Table 28, “DirectPath Floating-Point Instructions,”...
22007E/0—November 1999 VectorPath Instructions The following tables contain VectorPath instructions, which should be avoided in the AMD Athlon processor: Table 29, “VectorPath Integer Instructions,” on page 231 Table 30, “VectorPath MMX™ Instructions,” on page 234 and Table 31, “VectorPath MMX™ Extensions,” on page 234 Table 32, “VectorPath Floating-Point Instructions,”...
Need help?
Do you have a question about the Athlon Processor x86 and is the answer not in the manual?
Questions and answers