Page 1 AMD Athlon Processor x86 Code Optimization Guide...
Page 2 Trademarks AMD, the AMD logo, AMD Athlon, K6, 3DNow!, and combinations thereof, K86, and Super7 are trademarks, and AMD-K6 is a registered trademark of Advanced Micro Devices, Inc. Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation.
Page 3: Table Of Contents
About this Document ........1 AMD Athlon™ Processor Family......3 AMD Athlon Processor Microarchitecture Summary .
Page 4 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Switch Statement Usage........21 Optimize Switch Statements .
Page 5 Recommendations for AMD-K6 Family and AMD Athlon Processor Blended Code ....41 Cache and Memory Optimizations Memory Size and Alignment Issues ......45 Avoid Memory Size Mismatches .
Page 6 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Scheduling Optimizations Schedule Instructions According to their Latency ....67 Unrolling Loops......... . . 67 Complete Loop Unrolling .
Page 7 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Signed Derivation for Algorithm, Multiplier, and Shift Factor ......... 95 Floating-Point Optimizations Ensure All FPU Data is Aligned .
Page 8 Introduction ..........129 AMD Athlon Processor Microarchitecture ....130 Superscalar Processor .
Page 9 Write Combining ........139 AMD Athlon System Bus ......139...
Page 10 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 PerfCtr[3:0] MSRs (MSR Addresses C001_0004h–C001_0007h) ... . . 167 Starting and Stopping the Performance-Monitoring Counters ......... . 168 Event and Time-Stamp Monitoring Software.
Page 11 List of Figures Figure 1. AMD Athlon™ Processor Block Diagram ... 131 Figure 2. Integer Execution Pipeline ..... . . 135 Figure 3.
Page 12 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 List of Figures...
Page 13 Write Combining Completion Events ....158 Table 10. AMD Athlon™ System Bus Commands Generation Rules ....... 159 Table 11.
Page 14 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 29. VectorPath Integer Instructions ....231 Table 30. VectorPath MMX Instructions ....234 Table 31.
Page 15: Revision History
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Revision History Date Description Added “About this Document” on page 1. Further clarification of “Consider the Sign of Integer Operands” on page 14. Added the optimization, “Use Array Style Instead of Pointer Style Code” on page 15.
Page 16 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Revision History...
Page 17: Introduction
22007E/0—November 1999 Introduction The AMD Athlon™ processor is the newest microprocessor in the AMD K86™ family of microprocessors. The advances in the AMD Athlon processor take superscalar operation and out-of-order execution to a new level. The AMD Athlon processor has been designed to efficiently execute code written for previous-generation x86 processors.
Page 18 Chapter 11: General x86 Optimizations Guidelines. L i s t s g e n e r i c optimizations techniques applicable to x86 processors. Appendix A: AMD Athlon Processor Microarchitecture. D e s c r i b e s detail the microarchitecture of the AMD Athlon processor. About this Document...
Page 19: Amd Athlon™ Processor Family
Appendix C: Implementation of Write Combining. D e s c r i b e s t h e algorithm used by the AMD Athlon processor to write combine. Appendix D: Performance Monitoring Counters. Describes the usage of the performance counters available in the AMD Athlon processor.
Page 20: Amd Athlon Processor Microarchitecture Summary
To reduce on-chip cache miss penalties and to avoid subsequent data load or instruction fetch stalls, the AMD Athlon processor has a dedicated high-speed backside L2 cache. The large 128-Kbyte L1 on-chip cache and the backside L2 cache allow the...
Page 21 As a decoupled decode/execution processor, the AMD Athlon processor makes use of a proprietary microarchitecture, which defines the heart of the AMD Athlon processor. With the inclusion of all these features, the AMD Athlon processor is capable of decoding, issuing, executing, and retiring multiple x86 instructions per cycle, resulting in superior scaleable performance.
Page 22 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 The coding techniques for achieving peak performance on the AMD Athlon processor include, but are not limited to, those for ® the AMD-K6, AMD-K6-2, Pentium , Pentium Pro, and Pentium II processors. However, many of these optimizations are not necessary for the AMD Athlon processor to achieve maximum performance.
Page 23: Top Optimizations
G ro u p I I c o n t a i n s s e c o n d a ry o p t i m i z a t i o n s t h a t c a n Optimizations significantly improve the performance of the AMD Athlon processor. The optimizations in Group II are as follows:...
Page 24: Optimization Star
3DNow! PREFETCH and PREFETCHW instructions to increase the effective bandwidth to the AMD Athlon processor, which sig n ific a n tly im p roves p er fo rma n c e. A ll t h e p ref e tch...
Page 25: Select Directpath Over Vectorpath Instructions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 anywhere, in any type of code (integer, x87, 3DNow!, MMX, etc.). Use the following formula to determine prefetch distance: Prefetch Length = 200 ( Round up to the nearest cache line. DS is the data stride per loop iteration.
Page 26: Take Advantage Of Write Combining
B I O S p rog ra m m e rs . I n o rd e r t o i m p rove s y s t e m performance, the AMD Athlon processor aggressively combines multiple memory-write cycles of any data size that address locations within a 64-byte cache line aligned write buffer.
Page 27: Avoid Placing Code And Data In The Same 64-Byte Cache Line
22007E/0—November 1999 Avoid Placing Code and Data in the Same 64-Byte Cache Line Consider that the AMD Athlon processor cache line is twice the size of previous processors. Code and data should not be shared in the same 64-byte cache line, especially if the data ever becomes modified.
Page 28 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Group II Optimizations—Secondary Optimizations...
Page 29: C Source Level Optimizations
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 C Source Level Optimizations This chapter details C programming practices for optimizing code for the AMD Athlon™ processor. Guidelines are listed in order of importance. Ensure Floating-Point Variables and Expressions are of Type Float For compilers that generate 3DNow!™...
Page 30: Consider The Sign Of Integer Operands
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Consider the Sign of Integer Operands In many cases, the data stored in integer variables determines whether a signed or an unsigned integer type is appropriate. For example, to record the weight of a person in pounds, no negative numbers are required so an unsigned type is appropriate.
Page 31: Use Array Style Instead Of Pointer Style Code
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example (Avoid): int i; ====> EAX, i i = i / 4; EDX, 3 EAX, EDX EAX, 2 i, EAX Example (Preferred): unsigned int i; ====> i, 2 i = i / 4;...
Page 32 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Note that source code transformations will interact with a compiler’s code generator and that it is difficult to control the generated machine code from the source level. It is even possible that source code transformations for improving performance and compiler optimizations "fight"...
Page 33 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 *res++ = dp; /* write transformed z */ = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++; *res++ = dp; /* write transformed w */ ++vv;...
Page 34: Completely Unroll Small Loops
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Completely Unroll Small Loops Take advantage of the AMD Athlon processor’s large, 64-Kbyte instruction cache and completely unroll small loops. Unrolling loops can be beneficial to performance, especially if the loop body is small which makes the loop overhead significant. Many compilers are not aggressive at unrolling loops.
Page 35 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 code in a way that avoids the store-to-load dependency. In some instances the language definition may prohibit the compiler from using code transformations that would remove the store- to-load dependency. It is therefore recommended that the programmer remove the dependency manually, e.g., by...
Page 36: Consider Expression Order In Compound Branch Conditions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Consider Expression Order in Compound Branch Conditions Branch c ondit ions in C prog rams are oft en com pound conditions consisting of multiple boolean expressions joined by the boolean operators && and ||. C guarantees a short-circuit evaluation of these operators.
Page 37: Switch Statement Usage
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Switch Statement Usage Optimize Switch Statements Switch statements are translated using a variety of algorithms. The most common of these are jump tables and comparison chains/trees. It is recommended to sort the cases of a switch statement according to the probability of occurrences, with the most probable first.
Page 38: Use Const Type Qualifier
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use Const Type Qualifier Use the “const” type qualifier as much as possible. This optimization makes code more robust and may enable higher performance code to be generated due to the additional information available to the compiler.
Page 39: Generalization For Multiple Constant Control Code
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Generalization for Multiple Constant Control Code To generalize this further for multiple constant control code some more work may have to be done to create the proper outer loop. Enumeration of the constant cases will reduce this to a simple switch statement.
Page 40: Declare Local Functions As Static
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 case combine( 1, 1 ): for( i ... ) { DoWork1( i ); DoWork3( i ); break; default: break; The trick here is that there is some up-front work involved in generating all the combinations for the switch constant and the total amount of code has doubled.
Page 41: Dynamic Memory Allocation Consideration
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 w h i ch m i g h t i nh ib it c e rt a i n o p t i m i z a t i o n s w i t h so m e compilers—for example, aggressive inlining.
Page 42: Explicitly Extract Common Subexpressions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 lead to unexpected results. Fortunately, in the vast majority of cases, the final result will differ only in the least significant bits. Example 1 (Avoid): double a[100],sum; int i; sum = 0.0f;...
Page 43: C Language Structure Component Considerations
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 Avoid: double a,b,c,d,e,f; e = b*c/d; f = b/d*a; Preferred: double a,b,c,d,e,f,t; t = b/d; e = c*t; f = a*t; Example 2 Avoid: double a,b,c,e,f; e = a/c; f = b/c;...
Page 44: Sort Local Variables According To Base Type Size
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Pad by Multiple of Pad the structure to a multiple of the largest base type size of Largest Base Type any member. In this fashion, if the first member of a structure is...
Page 45: Accelerating Floating-Point Divides And Square Roots
The x87 FPU has a precision-control field as part of the FPU control word. The precision-control setting determines what precision results get rounded to. It affects the basic arithmetic operations, including divides and square roots. AMD Athlon ® and AMD-K6...
Page 46 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 necessary for the currently selected precision. This means that setting precision control to single precision (versus Win32 default of double precision) lowers the latency of those operations. ® The Microsoft Visual C environment provides functions to manipulate the FPU control word and thus the precision control.
Page 47: Avoid Unnecessary Integer Division
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Avoid Unnecessary Integer Division Integer division is the slowest of all integer arithmetic operations and should be avoided wherever possible. One possibility for reducing the number of integer divisions is multiple divisions, in which division can be replaced with multiplication as shown in the following examples.
Page 48 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): //assumes pointers are different and q!=r void isqrt ( unsigned long a, unsigned long *q, unsigned long *r) *q = a; if (a > 0) while (*q > (*r = a / *q)) *q = (*q + *r) >>...
Page 49: Instruction Decoding Optimizations
Instruction Decoding Optimizations This chapter discusses ways to maximize the number of instructions decoded by the instruction decoders in the AMD Athlon™ processor. Guidelines are listed in order of importance. Overview The AMD Athlon processor instruction fetcher reads 16-byte aligned code windows from the instruction cache. The instruction bytes are then merged into a 24-byte instruction queue.
Page 50: Select Directpath Over Vectorpath Instructions
D i re c t Pa t h i n s t r u c t i o n s i n t h e AMD Athlon processor. Assembly writers must still take into consideration the usage of DirectPath versus VectorPath instructions.
Page 51: Use Load-Execute Floating-Point Instructions With Floating-Point Operands
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use Load-Execute Floating-Point Instructions with Floating-Point Operands When operating on single-precision or double-precision floating-point data, wherever possible use floating-point load-execute instructions to increase code density. Note: This optimization applies only to floating-point instructions with floating-point operands and not with integer operands, as described in the next optimization.
Page 52: Align Branch Targets In Program Hot Spots
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): QWORD PTR [foo] FIMUL DWORD PTR [bar] FIADD DWORD PTR [baz] Example 2 (Preferred): FILD DWORD PTR [bar] FILD DWORD PTR [baz] QWORD PTR [foo] FMULP ST(2), ST FADDP...
Page 53: Avoid Partial Register Reads And Writes
;uses 1-byte opcode, ; 8-bit immediate Avoid Partial Register Reads and Writes In order to handle partial register writes, the AMD Athlon processor execution core implements a data-merging scheme. In the execution unit, an instruction writing a partial register merges the modified portion with the current state of the remainder of the register.
Page 54: Replace Certain Shld Instructions With Alternative Code
LEA REG1, [REG1*8 + REG2] Use 8-Bit Sign-Extended Immediates Using 8-bit sign-extended immediates improves code density with no negative effects on the AMD Athlon processor. For example, ADD BX, –5 should be encoded “83 C3 FB” and not “81 C3 FF FB”.
Page 55: Use 8-Bit Sign-Extended Displacements
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use 8-Bit Sign-Extended Displacements Use 8-bit sign-extended displacements for conditional branches. Using short, 8-bit sign-extended displacements for conditional branches improves code density with no negative effects on the AMD Athlon processor. Code Padding Using Neutral Code Fillers Occasionally a need arises to insert neutral code fillers into the code stream, e.g., for code alignment purposes or to space out...
Page 56: Recommendations For The Amd Athlon Processor
For code that is optimized specifically for the AMD Athlon processor, the optimal code fillers are NOP instructions (opcode 0x90) with up to two REP prefixes (0xF3). In the AMD Athlon processor, a NOP with up to two REP prefixes can be handled by a single decoder with no overhead.
Page 57: Recommendations For Amd-K6 ® Family And Amd Athlon Processor Blended Code
T h e f o l l o w i n g a s s e m b ly l a n g u a g e m a c r o s s h o w t h e recommended neutral code fillers for code optimized for the AMD Athlon processor that also has to run well on other x86 processors. Note for some padding lengths, versions using ESP or EBP are missing due to the lack of fully generalized addressing modes.
Page 58 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 NOP3_ECX TEXTEQU <DB 08Dh,00Ch,021h> ;lea ecx, [ecx] NOP3_EDX TEXTEQU <DB 08Dh,014h,022h> ;lea edx, [edx] NOP3_ESI TEXTEQU <DB 08Dh,024h,024h> ;lea esi, [esi] NOP3_EDI TEXTEQU <DB 08Dh,034h,026h> ;lea edi, [edi] NOP3_ESP TEXTEQU <DB 08Dh,03Ch,027h> ;lea esp, [esp] NOP3_EBP TEXTEQU <DB 08Dh,06Dh,000h>...
Page 59 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ;lea edi ,[edi+00000000] NOP6_EDI TEXTEQU <DB 08Dh,0BFh,0,0,0,0> ;lea ebp ,[ebp+00000000] NOP6_EBP TEXTEQU <DB 08Dh,0ADh,0,0,0,0> ;lea eax,[eax*1+00000000] NOP7_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0> ;lea ebx,[ebx*1+00000000] NOP7_EBX TEXTEQU <DB 08Dh,01Ch,01Dh,0,0,0,0> ;lea ecx,[ecx*1+00000000] NOP7_ECX TEXTEQU <DB 08Dh,00Ch,00Dh,0,0,0,0>...
Page 60 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Code Padding Using Neutral Code Fillers...
Page 61: Cache And Memory Optimizations
Cache and Memory Optimizations This chapter describes code optimization techniques that take advantage of the large L1 caches and high-bandwidth buses of the AMD Athlon™ processor. Guidelines are listed in order of importance. Memory Size and Alignment Issues Avoid Memory Size Mismatches Avoid memory size mismatches when instructions operate on the same data.
Page 62: Align Data Where Possible
3DNow! PREFETCH and PREFETCHW instructions to increase the effective bandwidth to the AMD Athlon processor. Th e P R E F E T C H a n d P R E F E T C H W i n s t r u c t i o n s t a ke advantage of the AMD Athlon processor’s high bus bandwidth...
Page 63 PREFETCHW works the same as a PREFETCH on the AMD-K6-2 and AMD-K6-III processors, PREFETCHW gives a hint to the AMD Athlon processor of an intent to modify the cache line. The AMD Athlon processor will mark the cache line being brought in by PREFET CHW as Modified. Using...
Page 64 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ECX, (-LARGE_NUM) ;used biased index EAX, OFFSET array_a ;get address of array_a EDX, OFFSET array_b ;get address of array_b ECX, OFFSET array_c ;get address of array_c $loop: PREFETCHW [EAX+196] ;two cachelines ahead...
Page 65 Determining Prefetch Given the latency of a typical AMD Athlon processor system Distance and expected processor speeds, the following formula should be...
Page 66: Take Advantage Of Write Combining
-c o m b i n i n g c a p ab il it ie s o f t h e AMD Athlon processor. The AMD Athlon processor has a very aggressive write-combining algorithm, which improves performance significantly.
Page 67: Store-To-Load Forwarding Restrictions
Store-to-load forwarding refers to the process of a load reading (forwarding) data from the store buffer (LS2). There are instances in the AMD Athlon processor load/store architecture when either a load operation is not allowed to read needed data from a store in the store buffer, or a load OP detects a false data dependency on a store in the store buffer.
Page 68 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Narrow-to-Wide I f t h e f o l l o w i n g c o n d i t i o n s a re p re s e n t , t h e re i s a...
Page 69 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 5 (Preferred): MOVD [foo], MM1 ;store lower half PUNPCKHDQ MM1, MM1 ;get upper half into lower half MOVD [foo+4], MM1 ;store lower half EAX, [foo] ;fine EDX, [foo+4] ;fine Misaligned If the following condition is present, there is a misaligned...
Page 70: Summary Of Store-To-Load Forwarding Pitfalls To Avoid
One Supported Store- There is one case of a mismatched store-to-load forwarding that to-Load Forwarding is supported by the by AMD Athlon processor. The lower 32 bits Case from an aligned QWORD write feeding into a DWORD read is allowed.
Page 71: Align Tbyte Variables On Quadword Aligned Addresses
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example (Preferred): Prolog: PUSH EBP, ESP ESP, SIZE_OF_LOCALS ;size of local variables ESP, –8 ;push registers that need to be preserved Epilog: ;pop register that needed to be preserved ESP, EBP With this technique, function arguments can be accessed via EBP, and local variables can be accessed via ESP.
Page 72: Sort Variables According To Base Type Size
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example: struct char a[5]; long doublex; } baz; The structure components should be allocated (lowest to highest address) as follows: x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, ..., padbyte0 See “C Language Structure Component Considerations” on page 27 for more information from a C source code perspective.
Page 73: Branch Optimizations
2 illustrate this concept using the CMOV instruction. Note ® that the AMD-K6 processor does not support the CMOV instruction. Therefore, blended AMD-K6 and AMD Athlon processor code should use examples 3 and 4. Avoid Branches Dependent on Random Data...
Page 74: Amd Athlon Processor Specific Code
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 AMD Athlon™ Processor Specific Code Example 1 — Signed integer ABS function (X = labs(X)): ECX, [X] ;load value EBX, ECX ;save value ;–value CMOVS ECX, EBX ;if –value is negative, select value [X], ECX ;save labs result...
Page 75: Always Pair Call And Return
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 6 — Increment Ring Buffer Offset: //C Code char buf[BUFSIZE]; int a; if (a < (BUFSIZE-1)) { a++; } else { a = 0; ;------------- ;Assembly Code EAX, [a] ; old offset EAX, (BUFSIZE-1) ;...
Page 76: Replace Branches With Computation In 3Dnow! Code
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Replace Branches with Computation in 3DNow!™ Code Branches negatively impact the performance of 3DNow! code. Branches can operate only on one data item at a time, i.e., they are inherently scalar and inhibit the SIMD processing that makes 3DNow! code superior.
Page 77: Sample Code Translated Into 3Dnow! Code
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 2 (Preferred): ; r = (x < y) ? a : b ; in: ; out: mm1 PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0 PAND MM1, MM3 ;...
Page 78 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 2: C code: float x,z; z = abs(x); if (z >= 1) { z = 1/z; 3DNow! code: ;in: MM0 = x ;out: MM0 = z MOVQ MM5, mabs ;0x7fffffff PAND MM0, MM5 ;z=abs(x)
Page 79 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 4: C code: #define PI 3.14159265358979323 float x,z,r,res; /* 0 <= r <= PI/4 */ z = abs(x) if (z < 1) { res = r; else { res = PI/2-r;...
Page 80 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 5: C code: #define PI 3.14159265358979323 float x,y,xa,ya,r,res; xs,df; xs = x < 0 ? 1 : 0; xa = fabs(x); ya = fabs(y); df = (xa < ya); if (xs && df) { res = PI/2 + r;...
Page 81: Avoid The Loop Instruction
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Avoid the Loop Instruction The LOOP instruction in the AMD Athlon processor requires eight cycles to execute. Use the preferred code shown below: Example 1 (Avoid): LOOP LABEL Example 2 (Preferred): LABEL Avoid Far Control Transfer Instructions Avoid using far control transfer instructions.
Page 82: Avoid Recursive Functions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Avoid Recursive Functions Avoid recursive functions due to the danger of overflowing the return address stack. Convert end-recursive functions to iterative code. An end-recursive function is when the function call to itself is at the end of the code.
Page 83: Scheduling Optimizations
Guidelines are listed in order of importance. Schedule Instructions According to their Latency The AMD Athlon™ processor can execute up to three x86 instructions per cycle, with each x86 instruction possibly having a different latency. The AMD Athlon processor has flexible scheduling, but for absolute maximum performance, schedule instructions, especially FPU and 3DNow!™...
Page 84: Partial Loop Unrolling
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 unrolling reduces register pressure by removing the loop counter. To completely unroll a loop, remove the loop control and replicate the loop body N times. In addition, completely unrolling a loop increases scheduling opportunities.
Page 85 EAX, 8 EBX, 8 $add_loop The loop consists of seven instructions. The AMD Athlon processor can decode/retire three instructions per cycle, so it cannot execute faster than three iterations in seven cycles, or 3/7 floating-point adds per cycle. However, the pipelined floating-point adder allows one add every cycle.
Page 86 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 n o f a s t e r t h a n t h re e i t e ra t i o n s i n 1 0 cy c l e s , o r 6 / 1 0 floating-point adds per cycle, or 1.4 times as fast as the original...
Page 87: Use Function Inlining
AMD Athlon processor is less susceptible than other processors to the negative side effect of function inlining. Function call overhead on the AMD Athlon processor can be low because calls and returns are executed at high speed due to the use of prediction mechanisms.
Page 88: Always Inline Functions If Called From One Site
Avoid Address Generation Interlocks Loads and stores are scheduled by the AMD Athlon processor to access the data cache in program order. Newer loads and stores with their addresses calculated can be blocked by older loads and stores whose addresses are not yet calculated –...
Page 89: Use Movzx And Movsx
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): ADD EBX, ECX ;inst 1 MOV EAX, DWORD PTR [10h] ;inst 2 (fast address calc.) MOV ECX, DWORD PTR [EAX+EBX] ;inst 3 (slow address calc.) MOV EDX, DWORD PTR [24h] ;this load is stalled from...
Page 90 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i; for (i=0; i < MAXSIZE; i++) { c [i] = a[i] + b[i]; ECX, MAXSIZE ;initialize loop counter ESI, ESI ;initialize offset into array a EDI, EDI ;initialize offset into array b...
Page 91: Push Memory Data Carefully
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 variable that starts with a negative value and reaches zero when the loop expires. Note that if the base addresses are held in registers (e.g., when the base addresses are passed as...
Page 92 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Push Memory Data Carefully...
Page 93: Integer Optimizations
Replace Divides with Multiplies Replace integer division by constants with multiplication by the reciprocal. Because the AMD Athlon™ processor has a very fast integer multiply (5–9 cycles signed, 4–8 cycles unsigned) and the integer division delivers only one bit of quotient per cycle (22–47 cycles signed, 17–41 cycles unsigned), the...
Page 94: Unsigned Division By Multiplication Of Constant
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Signed Division In the opt_utilities directory of the AMD documentation Utility CDROM, run sdiv.exe in a DOS window to find the fastest code for signed division by a constant. The utility displays the code after the user enters a signed constant divisor.
Page 95: Signed Division By Multiplication Of Constant
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1: ;In: EDX = dividend ;Out: EDX = quotient XOR EDX, EDX;0 CMP EAX, d ;CF = (dividend < divisor) ? 1 : 0 SBB EDX, -1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1...
Page 96 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ;algorithm 1 EAX, m EDX, dividend ECX, EDX IMUL EDX, ECX ECX, 31 EDX, s EDX, ECX ;quotient in EDX Derivation for a, m, s The derivation for the algorithm (a), multiplier (m), and shift count (s), is found in the section “Signed Derivation for...
Page 97: Use Alternative Code When Multiplying By A Constant
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Remainder of Signed ;IN:EAX = dividend ;OUT:EAX = remainder Integer 2 or –(2 ;Sign extend into EDX EDX, (2^n–1) ;Mask correction (abs(divison)–1) EAX, EDX ;Apply pre-correction EAX, (2^n–1) ;Mask out remainder (abs(divison)–1) EAX, EDX ;Apply pre-correction, if necessary...
Page 98 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 by 11: REG2, [REG1*8+REG1] ;3 cycles REG1, REG1 REG1, REG2 by 12: REG1, 2 REG1, [REG1*2+REG1] ;3 cycles by 13: REG2, [REG1*2+REG1] ;3 cycles REG1, 4 REG1, REG2 by 14: REG2, [REG1*4+REG1] ;3 cycles...
Page 99: Use Mmx™ Instructions For Integer-Only Work
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 by 26: use IMUL by 27: REG2, [REG1*4+REG1] ;3 cycles REG1, 5 REG1, REG2 by 28: REG2, REG1 ;3 cycles REG1, 3 REG1, REG2 REG1, 2 by 29: REG2, [REG1*2+REG1] ;3 cycles...
Page 100: Repeated String Instruction Usage
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 In addition, using MMX instructions increases the available parallelism. The AMD Athlon processor can issue three integer OPs and two MMX OPs per cycle. Repeated String Instruction Usage Latency of Repeated String Instructions Table 1 shows the latency for repeated string instructions on the AMD Athlon processor.
Page 101 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Ensure DF=0 (UP) Always make sure that DF = 0 (UP) (after execution of CLD) for REP MOVS and REP STOS. DF = 1 (DOWN) is only needed for certain cases of overlapping REP MOVS (for example, source and destination overlap).
Page 102: Use Xor Instruction To Clear Integer Registers
Use XOR Instruction to Clear Integer Registers To clear an integer register to all 0s, use “XOR reg, reg”. The AMD Athlon processo r is able to avoid the false rea d dependency on the XOR instruction. Example 1 (Acceptable):...
Page 103 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 4 (Left shift): ;shift operand in EDX:EAX left, shift count in ECX (count applied modulo 64) SHLD EDX, EAX, CL ;first apply shift count EAX, CL ; mod 32 to EDX:EAX...
Page 104 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 7 (Division): ;_ulldiv divides two unsigned 64-bit integers, and returns the quotient. ;INPUT: [ESP+8]:[ESP+4] dividend [ESP+16]:[ESP+12] divisor ;OUTPUT: EDX:EAX quotient of division ;DESTROYS: EAX,ECX,EDX,EFlags _ulldiv PROC PUSH ;save EBX as per calling convention ECX, [ESP+20] ;divisor_hi...
Page 105 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ECX, EAX ;save quotient IMUL EDI, EAX ;quotient * divisor hi-word ; (low only) DWORD PTR [ESP+20];quotient * divisor lo-word EDX, EDI ;EDX:EAX = quotient * divisor EBX, EAX ;dividend_lo – (quot.*divisor)_lo EAX, ECX ;get quotient...
Page 106 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 $r_two_divs: ECX, EAX ;save dividend_lo in ECX EAX, EDX ;get dividend_hi EDX, EDX ;zero extend it into EDX:EAX ;EAX = quotient_hi, EDX = intermediate ; remainder EAX, ECX ;EAX = dividend_lo ;EAX = quotient_lo EAX, EDX ;EAX = remainder_lo...
Page 107: Efficient Implementation Of Population Count Function
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Efficient Implementation of Population Count Function Population count is an operation that determines the number of set bits in a bit string. For example, this can be used to determine the cardinality of a set. The following example code shows how to efficiently implement a population count operation for 32-bit operands.
Page 108 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Step 3 For the first time, the value in each k-bit field is small enough that adding two k-bit fields results in a value that still fits in the k-bit field. Thus the following computation is performed: y = (x + (x >>...
Page 109: By Constants
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 EAX, EDX ;x = (w & 0x33333333) + ((w >> 2) & ; 0x33333333) EDX, EDX EAX, 4 ;x >> 4 EAX, EDX ;x + (x >> 4) EAX, 00F0F0F0Fh ;y = (x + (x >> 4) & 0x0F0F0F0F) IMUL EAX, 001010101h ;y * 0x01010101...
Page 110 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ;algorithm 1 EDX, dividend EAX, m EAX, m EDX, 0 EDX, s ;EDX=quotient typedef unsigned __int64 U64; typedef unsigned long U32; U32 d, l, s, m, a, r; U64 m_low, m_high, j, k;...
Page 111: Shift Factor
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 /* Generate m, s for algorithm 1. Based on: Magenheimer, D.J.; et al: “Integer Multiplication and Division on the HP Precision Architecture”. IEEE Transactions on Computers, Vol 37, No. 8, August 1988, page 980. */ else { s = log2(d);...
Page 112 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ;algorithm 1 EAX, m EDX, dividend ECX, EDX IMUL EDX, ECX ECX, 31 EDX, s EDX, ECX ; quotient in EDX typedef unsigned __int64 U64; typedef unsigned long U32; U32 log2 (U32 i) U32 t = 0;...
Page 113: Floating-Point Optimizations
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Optimizations T h i s ch a p t e r d e t a i l s t h e m e t h o d s u s e d t o o p t i m i z e floating-point code to the pipelined floating-point unit (FPU).
Page 114: Use Ffreep Macro To Pop One Register From The Fpu Stack
;removes one register from stack FCOMPP ;removes two registers from stack On the AMD Athlon processor, a faster alternative is to use the FFREEP instruction below. Note that the FFREEP instruction, although insufficiently documented in the past, is supported by all 32-bit x86 processors.
Page 115: Use The Fxch Instruction Rather Than Fst/Fld Pairs
Although the AMD Athlon processor FPU has a deep scheduler, which in most cases can extract sufficient parallelism from existing code, long dependency chains can stall the scheduler while issue slots are still available.
Page 116: Minimize Floating-Point-To-Integer Conversions
FLDCW [SAVE_CW] ;restore original control word The AMD Athlon processor contains special acceleration hardware to execute such code as quickly as possible. In most situations, the above code is therefore the fastest way to perform floating-point-to-integer conversion and the conversion is compliant both with programming language standards and the IEEE-754 standard.
Page 117 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 FP U into tr uncating mo de, and perfor ming all of the conversions before restoring the original control word. The speed of the above code is somewhat dependent on the nature of the code surrounding it. For applications in which the...
Page 118 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 3 (Potentially faster): ECX, DWORD PTR[X+4] ;get upper 32 bits of double EDX, EDX ;i = 0 EAX, ECX ;save sign bit ECX, 07FF00000h ;isolate exponent field ECX, 03FF00000h ;if abs(x) < 1.0 $DONE2 ;...
Page 119: Floating-Point Subexpression Elimination
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Subexpression Elimination There are cases which do not require an FXCH instruction after every instruction to allow access to two new stack entries. In the cases where two instructions share a source operand, an FXCH is not required between the two instructions.
Page 120 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 If an “argument out of range” is detected, a range reduction subroutine is invoked which reduces the argument to less than 2^63 before the instruction is attempted again. While an argument > 2^63 is unusual, it often indicates a problem elsewhere in the code and the code may completely fail in the absence of a properly guarded trigonometric instruction.
Page 121: Take Advantage Of The Fsincos Instruction
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Since out-of-range arguments are extremely uncommon, the conditional branch will be perfectly predicted, and the other instructions used to guard the trigonometric instruction can execute in parallel to it. Take Advantage of the FSINCOS Instruction...
Page 122 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Take Advantage of the FSINCOS Instruction...
Page 123: 3Dnow!™ And Mmx™ Optimizations
3DNow!™ and MMX™ Optimizations This chapter describes 3DNow! and MMX code optimization techniques for the AMD Athlon™ processor. Guidelines are listed in order of importance. 3DNow! porting guidelines can be found in the 3DNow!™ Instruction Porting Guide, order# 22621. Use 3DNow!™ Instructions...
Page 124: Use 3Dnow! Instructions For Fast Division
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 FEMMS instruction is supported for backward compatibility with AMD-K6 family processors, and is aliased to the EMMS instruction. 3DNow! and MMX instructions are designed to be used concurrently with no switching issues. Likewise, enhanced 3DNow! instructions can be used simultaneously with MMX instructions.
Page 125: Pipelined Pair Of 24-Bit Precision Divides
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Pipelined Pair of 24-Bit Precision Divides This divide operation executes with a total latency of 21 cycles, assuming that the program hides the latency of the first MOVD/MOVQ instructions within preceding code.
Page 126: Use 3Dnow! Instructions For Fast Square Root And Reciprocal Square Root
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use 3DNow!™ Instructions for Fast Square Root and Reciprocal Square Root 3DNow! instructions can be used to compute a very fast, highly accurate square root and reciprocal square root. Optimized 15-Bit Precision Square Root...
Page 127: Newton-Raphson Reciprocal Square Root
= PFMUL(b,X The 24-bit final reciprocal square root value is X . In the AMD Athlon processor 3DNow! implementation, the estimate contains the correct round-to-nearest value for approximately 87% of all arguments. The remaining arguments differ from the correct round-to-nearest value by one unit-in-the-last-place. The...
Page 128: 3Dnow! And Mmx Intra-Operand Swapping
= mmreg2[31:0]) mmreg1[31:0] = mmreg2[63:32]) See the AMD Extensions to the 3DNow! and MMX Instruction Set Manual, order #22466 for more usage information. Blended Code Otherwise, for blended code, which needs to run well on...
Page 129: Fast Conversion Of Signed Words To Floating-Point
AMD processors. The first example shows how to do the conversion on a processor that supports AMD ’s 3 DN ow! ex te n si on s, such as t h e AMD Athlon processor. It demonstrates the increased efficiency from using the PI2FW instruction.
Page 130: Use Mmx Pcmp Instead Of 3Dnow! Pfcmp
PXOR and PMUL instructions are the same in terms of latency. On the AMD-K6 processor, there is only a one cycle latency for PXOR, versus a two cycle latency for the 3DNow! PFMUL instruction.
Page 131: Use Mmx Instructions For Block Copies And Block Fills
AMD Athlon processor specific code where the destination is in cacheable memory and immediate data re-use of the data at the destination is expected AMD-K6 family specific code where the destination is in non-cacheable memory Example 1: /* block copy (source and destination QWORD aligned) */...
Page 132 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 $xfer: movq mm0, [eax] edx, 64 movq mm1, [eax+8] eax, 64 movq mm2, [eax-48] movq [edx-64], mm0 movq mm0, [eax-40] movq [edx-56], mm1 movq mm1, [eax-32] movq [edx-48], mm2 movq mm2, [eax-24]...
Page 133 Microsoft Visual C, is suitable for moving/filling a quadword Code aligned block of data in the following situations: AMD Athlon processor specific code where the destination of the block copy is in non-cacheable memory space AMD Athlon processor specific code where the destination of the block copy is in cacheable space, but no immediate data re-use of the data at the destination is expected.
Page 134: Use Mmx Pxor To Clear All Bits In An Mmx Register
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 /* block fill (destination QWORD aligned) */ __asm { edx, [dst_ptr] ecx, [blk_size] ecx, 6 movq mm0, [fill_data] align 16 $fill_nc: movntq [edx], mm0 movntq [edx+8], mm0 movntq [edx+16], mm0 movntq [edx+24], mm0...
Page 135: Use Mmx Pcmpeqd To Set All Bits In An Mmx Register
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register To set all the bits in an MMX register to one, use: PCMPEQD MMreg, MMreg Note that PCMPEQD MMreg, MMreg is dependent on previous writes to MMreg.
Page 136 EBX, [RES] ;EBX = destination vector ptr ECX, [NUMVERTS] ;ECX = number of vertices to transform ;3DNow! version of fully general 3D vertex tranformation. ;Optimal for AMD Athlon (completes in 16 cycles) FEMMS ;clear MMX state ALIGN ;for optimal branch alignment...
Page 137 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 $$xform: EBX, 16 ;res++ MOVQ MM0, QWORD PTR [EDX] ;v->y | v->x MOVQ MM1, QWORD PTR [EDX+8] ;v->w | v->z EDX, 16 ;v++ MOVQ MM2, MM0 ;v->y | v->x MOVQ MM3, QWORD PTR [EAX+M00] ;m[0][1] | m[0][0] PUNPCKLDQ MM0, MM0 ;v->x | v->x...
Page 138: Efficient 3D-Clipping Code Computation Using 3Dnow! Instructions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Efficient 3D-Clipping Code Computation Using 3DNow!™ Instructions Clipping is one of the major activities occurring in a 3D graphics pipeline. In many instances, this activity is split into two parts which do not necessarily have to occur consecutively:...
Page 139: Use 3Dnow! Pavgusb For Mpeg-2 Motion Compensation
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 DESTROYS MM0,MM1,MM2,MM3,MM4 PXOR MM0, MM0 ; 0 | 0 MOVQ MM1, MM6 ; w | z MOVQ MM4, MM5 ; y | x PUNPCKHDQ MM1, MM1 ; w | w MOVQ MM3, MM6 ;...
Page 140 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): ESI, DWORD PTR Src_MB EDI, DWORD PTR Dst_MB EDX, DWORD PTR SrcStride EBX, DWORD PTR DstStride MOVQ MM7, QWORD PTR [ConstFEFE] MOVQ MM6, QWORD PTR [Const0101] ECX, 16 MOVQ MM0, [ESI] ;MM0=QWORD1...
Page 141: Stream Of Packed Unsigned Bytes
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 The following code fragment uses the 3DNow! PAVGUSB ins tr uction to perfor m averaging between the source macroblock and destination macroblock: Example 2 (Preferred): EAX, DWORD PTR Src_MB EDI, DWORD PTR Dst_MB...
Page 142: Complex Number Arithmetic
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Complex Number Arithmetic Complex numbers have a “real” part and an “imaginary” part. Multiplying complex numbers (ex. 3 + 4i) is an integral part of many algorithms such as Discrete Fourier Transform (DFT) and complex FIR filters.
Page 143: General X86 Optimization Guidelines
AMD-K6 processor, Pentium, and Pentium Pro processors either improve the performance of the AMD Athlon processor or are not required and have a neutral effect (usually due to fewer coding restrictions with the AMD Athlon processor).
Page 144: Dependencies
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Dependencies Spread out true dependencies to increase the opportunities for p a ra l l e l e x e c u t i o n . A n t i -d e p e n d e n c i e s a n d o u t p u t dependencies do not impact performance.
Page 145: Amd Athlon™ Processor Microarchitecture
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Appendix A AMD Athlon™ Processor Microarchitecture Introduction When discussing processor design, it is important to understand the following terms—architecture, microarchitecture, and design implementation. The term architecture refers to the instruction set and features of a processor that are visible to software p rog ra m s r u n n ing o n t h e p ro c e s so r.
Page 146: Amd Athlon Processor Microarchitecture
Instead of executing complex x86 instructions, which have lengths from 1 to 15 bytes, the AMD Athlon processor executes the simpler fixed-length OPs, while maintaining the instruction coding efficiencies found in x86 programs. The enhanced microarchitecture used in the...
Page 147: Instruction Cache
L2 SRAMs Figure 1. AMD Athlon™ Processor Block Diagram Instruction Cache The out-of-order execute engine of the AMD Athlon processor contains a very large 64-Kbyte L1 instruction cache. The L1 instruction cache is organiz ed as a 64-Kbyte, two-way, set-associative array. Each line in the instruction array is 64 bytes long.
Page 148: Predecode
The AMD Athlon processor employs combinations of a branch target address buffer (BTB), a global history bimodal counter (GHBC) table, and a return address stack (RAS) hardware in order to predict and accelerate branches.
Page 149: Early Decoding
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 return stack. Subsequent RETs pop a predicted return address off the top of the stack. Early Decoding T h e D i re c t Pa t h a n d Ve c t o r Pa t h d e c o d e r s p e r f o r m early-decoding of instructions into MacroOPs.
Page 150: Instruction Control Unit
22007E/0—November 1999 Instruction Control Unit The instruction control unit (ICU) is the control center for the AMD Athlon processor. The ICU controls the following resources—the centralized in-flight reorder buffer, the integer scheduler, and the floating-point scheduler. In turn, the ICU is responsible for the following functions —...
Page 151: Integer Scheduler
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Integer Scheduler The integer scheduler is based on a three-wide queuing system (also known as a reservation station) that feeds three integer execution positions or pipes. The reservation stations are six entries deep, for a total queuing system of 18 integer MacroOPs.Each reservation station divides the MacroOPs into...
Page 152: Floating-Point Scheduler
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Each of the three IEUs are general purpose in that each performs logic functions, arithmetic functions, conditional functions, divide step functions, status flag multiplexing, and branch resolutions. The AGUs calculate the logical addresses for loads, stores, and LEAs.
Page 153: Floating-Point Execution Unit
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Execution Unit The floating-point execution unit (FPU) is implemented as a coprocessor that has its own out-of-order control in addition to the data path. The FPU handles all register operations for x87 instructions, all 3DNow! operations, and all MMX operations.
Page 154: Load-Store Unit (Lsu)
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Load-Store Unit (LSU) The load-store unit (LSU) manages data load and store accesses to the L1 data cache and, if required, to the backside L2 cache or system memory. The 44-entry LSU provides a data interface for both the integer scheduler and the floating-point scheduler.
Page 155: L2 Cache Controller
155 for detailed information about write combining. AMD Athlon™ System Bus The AMD Athlon system bus is a high-speed bus that consists of a pair of unidirectional 13-bit address and control channels and a bidirectional 64-bit data bus. The AMD Athlon system bus supports low-voltage swing, multiprocessing, clock forwarding, and fast data transfers.
Page 156 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 AMD Athlon™ Processor Microarchitecture...
Page 157: Appendix B Pipeline And Execution Unit Resources Overview
Fetch and Decode Pipeline Stages Figure 5 on page 142 and Figure 6 on page 142 show the AMD Athlon processor instruction fetch and decoding pipeline stages. The pipeline consists of one cycle for instruction fetches and four cycles of instruction alignment and decoding. The...
Page 158: Figure 5. Fetch/Scan/Align/Decode Pipeline Hardware
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 E n try V ec to rP ath V ec to rP ath P o in t D ec o d e D ec o d e M R O M D ec o d e...
Page 159 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Cycle 1–FETCH The FETCH pipeline stage calculates the address of the next x86 instruction window to fetch from the processor caches or system memory. Cycle 2–SCAN SCAN determines the start and end pointers of instructions.
Page 160: Integer Pipeline Stages
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 operands mapped to registers. Both integer and floating-point MacroOPs are placed into the ICU. Integer Pipeline Stages The integer execution pipeline consists of four or more stages for scheduling and execution and, if necessary, accessing data in the processor caches or system memory.
Page 161 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Cycle 7–SCHED In the scheduler (SCHED) pipeline stage, the scheduler buffers can contain MacroOPs that are waiting for integer operands from the ICU or the IEU result bus. When all operands are received, SCHED schedules the MacroOP for execution and issues the OPs to the next stage, EXEC.
Page 162: Floating-Point Pipeline Stages
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Pipeline Stages The floating-point unit (FPU) is implemented as a coprocessor that has its own out-of-order control in addition to the data path. The FPU handles all register operations for x87 instructions, all 3DNow! operations, and all MMX operations.
Page 163 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Cycle 7–STKREN The stack rename (STKREN) pipeline stage in cycle 7 receives up to three MacroOPs from IDEC and maps stack-relative register tags to virtual register tags. Cycle 8–REGREN The register renaming (REGREN) pipeline stage in cycle 8 is responsible for register renaming.
Page 164: Execution Unit Resources
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Execution Unit Resources Terminology The execution units operate with two types of register values— operands and results. There are three operand types and two result types, which are described in this section.
Page 165: Integer Pipeline Operations
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Integer Pipeline Operations Table 2 shows the category or type of operations handled by the integer pipeline. Table 3 shows examples of the decode type. Table 2. Integer Pipeline Operation Types Category...
Page 166: Floating-Point Pipeline Operations
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Pipeline Operations Table 4 shows the category or type of operations handled by the floating-point execution units. Table 5 shows examples of the decode types. Table 4. Floating-Point Pipeline Operation Types...
Page 167: Load/Store Pipeline Operations
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Load/Store Pipeline Operations The AMD Athlon processor decodes any instruction that references memory into primitive load/store operations. For example, consider the following code sample: AX, [EBX] ;1 load MacroOP PUSH ;1 store MacroOP ;1 load MacroOP...
Page 168: Code Sample Analysis
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Code Sample Analysis The samples in Table 7 on page 153 and Table 8 on page 154 show the execution behavior of several series of instructions as a function of decode constraints, dependencies, and execution resource constraints.
Page 169: Table 7. Sample 1 - Integer Register Operations
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 7. Sample 1 – Integer Register Operations Clocks Instruction Decode Decode Number Pipe Type Instruction IMUL EAX, ECX EDI, 0x07F4 EDI, EBX EAX, 8 EAX, 0x0F ESI, EDX Comments for Each Instruction Number 1.
Page 170: Table 8. Sample 2 - Integer Register And Memory Load
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 8. Sample 2 – Integer Register and Memory Load Operations Clocks Instruc Decode Decode Instruction Pipe Type 10 11 12 EDI, [ECX] &/S EAX, [EDX+20] &/S EAX, 5 ECX, [EDI+4] &/S...
Page 171: Appendix C Implementation Of Write Combining
Write Combining Introduction This appendix describes the memory write-combining feature as implemented in the AMD Athlon™ processor family. The AMD Athlon processor supports the memory type and range register (MTRR) and the page attribute table (PAT) extensions, which allow software to define ranges of memory as either writeback (WB), write-protected (WP), writethrough (WT), uncacheable (UC), or write-combining (WC).
Page 172: Write-Combining Definitions And Abbreviations
The steps required for programming write combining on the AMD Athlon processor are as follows: 1. Verify the presence of an AMD Athlon processor by using the CPUID instruction to check for the instruction family code and vendor identification of the processor. Standard...
Page 173: Write-Combining Operations
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 signature in register EAX, where EAX[11–8] contains the instruction family code. For the AMD Athlon processor, the instruction family code is six. 2. In addition, the presence of the MTRRs is indicated by bit...
Page 174: Table 9. Write Combining Completion Events
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 9. Write Combining Completion Events Event Comment The first non-WB write to a different cache block address closes combining for previous writes. WB writes do not affect Non-WB write outside of write combining.
Page 175: Sending Write-Buffer Data To The System
Once write combining is closed for a 64-byte write buffer, the contents of the write buffer are eligible to be sent to the system as one or more AMD Athlon system bus commands. Table 10 lists the rules for determining what system commands are issued for a write buffer, as a function of the alignment of the valid buffer data.
Page 176 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Write-Combining Operations...
Page 177: Appendix D Performance-Monitoring Counters
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Appendix D Performance-Monitoring Counters This chapter describes how to use the AMD Athlon™ processor performance monitoring counters. Overview The AMD Athlon processor provides four 48-bit performance counters, which allows four types of events to be monitored simultaneously.
Page 178: Perfevtsel[3:0] Msrs (Msr Addresses C001_0000H-C001_0003H)
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 These registers can be read from and written to using the RDMSR and WRMSR instructions, respectively. The PerfEvtSel[3:0] registers are located at MSR locations C001_0000h to C001_0003h. The PerfCtr[3:0] registers are located at MSR locations C001_0004h to C0001_0007h and are 64-byte registers.
Page 179 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Unit Mask Field (Bits These bits are used to further qualify the event selected in the 8—15) event select field. For example, for some cache events, the mask is used as a MESI-protocol qualifier of cache states. See Table 11 on page 164 for a list of unit masks and their 8-bit codes.
Page 180: Table 11. Performance-Monitoring Counters
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 greater than or equal to the counter mask. Otherwise if this field is zero, then the counter increments by the total number of events. Table 11. Performance-Monitoring Counters Event Source Notes / Unit Mask (bits 15–8)
Page 181 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 11. Performance-Monitoring Counters (Continued) Event Source Notes / Unit Mask (bits 15–8) Event Description Number Unit 1xxx_xxxxb = reserved x1xx_xxxxb = WB xx1x_xxxxb = WP System requests with the selected type xxx1_xxxxb = WT bits 11–10 = reserved...
Page 182 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 11. Performance-Monitoring Counters (Continued) Event Source Notes / Unit Mask (bits 15–8) Event Description Number Unit Cycles that at least one fill request waited to use the L2 Instruction cache fetches...
Page 183 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 11. Performance-Monitoring Counters (Continued) Event Source Notes / Unit Mask (bits 15–8) Event Description Number Unit ICU full Reservation stations full FPU full LS full All quiet stall Far transfer or resync branch pending...
Page 184: Counters
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 allows writing both positive and negative values to the performance counters. The performance counters may be initialized using a 64-bit signed integer in the range -2 . Negative values are useful for generating an interrupt after a specific number of events.
Page 185: Monitoring Counter Overflow
RDTSC and RDPMC instructions, which allow application code to read the counters directly. Monitoring Counter Overflow The AMD Athlon processor provides the option of generating a debug interrupt when a performance-monitoring counter overflows. This mechanism is enabled by setting the interrupt enable flag in one of the PerfEvtSel[3:0] MSRs.
Page 186 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 An event monitor application utility or another application program can read the collected performance information of the profiled application. Monitoring Counter Overflow...
Page 187: Appendix E Programming The Mtrr And Pat
Appendix E Programming the MTRR and Introduction The AMD Athlon™ processor includes a set of memory type and range registers (MTRRs) to control cacheability and access to specified memory regions. The processor also includes the Page Address Table for defining attributes of pages. This chapter documents the use and capabilities of this feature.
Page 188 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 There are two types of address ranges: fixed and variable. (See Figure 12.) For each address range, there is a memory type. For each 4K, 16K or 64K segment within the first 1 Mbyte of memory, there is one fixed address MTRR.
Page 189: Figure 12. Mtrr Mapping Of Physical Memory
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 FFFFFFFFh SMM TSeg 0-8 Variable Ranges to 2 64 Fixed Ranges 100000h 256 Kbytes (4 Kbytes each) C0000h 16 Fixed Ranges 256 Kbytes 80000h (16 Kbytes each) 8 Fixed Ranges 512 Kbytes (64 Kbytes each) Figure 12.
Page 190: Figure 13. Mtrr Capability Register Format
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Memory Types Five standard memory types are defined by the AMD Athlon processor: writethrough (WT), writeback (WB), write-protect (WP), write-combining (WC), and uncacheable (UC). These are described in Table 12 on page 174.
Page 191: Figure 14. Mtrr Default Type Register Format
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 MTRR Default Type Register Format. The MTRR default type register is defined as follows. Type Reserved Symbol Description Bits MTRRs Enabled Fixed Range Enabled Type Default Memory Type 7–0 Figure 14. MTRR Default Type Register Format MTRRs are enabled when set.
Page 192: Table 13. Standard Mtrr Types And Properties
When a large page (2 Mbytes/4 Mbytes) mapping covers a region that contains more than one memory type (as mapped by the MTRRs), the AMD Athlon processor does not suppress the caching of that large page mapping and only caches the mapping for just that 4-Kbyte piece in the 4-Kbyte TLB.
Page 193: Page Attribute Table (Pat)
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 not affected by this issue, only the variable range (and MTRR DefType) registers are affected. Page Attribute Table (PAT) The Page Attribute Table (PAT) is an extension of the page table entry format, which allows the specification of memory types to regions of physical memory based on the linear address.
Page 194: Table 14. Pati 3-Bit Encodings
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Accessing the PAT A 3-bit index consisting of the PATi, PCD, and PWT bits of the page table entry, is used to select one of the seven PAT register fields to acquire the memory type for the desired page (PATi is defined as bit 7 for 4-Kbyte PTEs and bit 12 for PDEs which map to 2-Mbyte or 4-Mbyte pages).
Page 195: Table 15. Effective Memory Type Based On Pat And
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 15. Effective Memory Type Based on PAT and MTRRs PAT Memory Type MTRR Memory Type Effective Memory Type WB, WT, WP, WC UC-Page UC-MTRR WB, WT WB, WP UC-MTRR WC, WT Notes: 1.
Page 196: Table 16. Final Output Memory Types
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 16. Final Output Memory Types Input Memory Type Output Memory Type AMD-751 Note 1, 2 Page Attribute Table (PAT)
Page 197 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 16. Final Output Memory Types (Continued) Input Memory Type Output Memory Type AMD-751 Note Notes: 1. WP is not functional for RdMem/WrMem. 2. ForceCD must cause the MTRR memory type to be ignored in order to avoid x’s.
Page 198: Table 17. Mtrr Fixed Range Register Format
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 MTRR Fixed-Range The memory types defined for memory segments defined in Register Format each of the MTRR fixed-range registers are defined in Table 17 (Also See “Standard MTRR Types and Propert ies” on page 176.).
Page 199: Figure 16. Mtrrphysbasen Register Format
The variable address range is power of 2 sized and aligned. The Register Format range of supported sizes is from 2 to 2 in powers of 2. The AMD Athlon processor does not implement A[35:32]. Type Physical Base Reserved Symbol...
Page 200: Figure 17. Mtrrphysmaskn Register Format
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Physical Mask Reserved Symbol Description Bits Physical Mask 24-Bit Mask 35–12 Variable Range Register Pair Enabled 11 (V = 0 at reset) Figure 17. MTRRphysMaskn Register Format Note: A software attempt to write to reserved bits will generate a general protection exception.
Page 201: Table 18. Mtrr-Related Model-Specific Register
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 MTRR MSR Format This table defines the model-specific registers related to the memory type range register implementation. All MTRRs are defined to be 64 bits. Table 18. MTRR-Related Model-Specific Register (MSR) Map...
Page 202 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Page Attribute Table (PAT)
Page 203: Appendix F Instruction Dispatch And Execution Resources
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Appendix F Instruction Dispatch and Execution Resources This chapter describes the MacroOPs generated by each decoded instruction, along with the relative static execution latencies of these groups of operations. Tables 19 through 24 starting on page 188 define the integer, MMX™, MMX...
Page 204: Table 19. Integer Instructions
DirectPath or VectorPath (see “DirectPath Decoder” on page 13 3 and “Vec t orPa th D ec od e r” on pag e 13 3 fo r m o re information). The AMD Athlon™ processor enhanced decode logic can process three instructions per clock.
Page 205 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type ADC mreg8, reg8 11-xxx-xxx DirectPath ADC mem8, reg8 mm-xxx-xxx DirectPath ADC mreg16/32, reg16/32 11-xxx-xxx DirectPath ADC mem16/32, reg16/32...
Page 206 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type AND mem8, reg8 mm-xxx-xxx DirectPath AND mreg16/32, reg16/32 11-xxx-xxx DirectPath AND mem16/32, reg16/32 mm-xxx-xxx DirectPath AND reg8, mreg8...
Page 207 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type BT mem16/32, imm8 mm-100-xxx DirectPath BTC mreg16/32, reg16/32 11-xxx-xxx VectorPath BTC mem16/32, reg16/32 mm-xxx-xxx VectorPath BTC mreg16/32, imm8...
Page 208 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type CMOVE/CMOVZ reg16/32, reg16/32 11-xxx-xxx DirectPath CMOVE/CMOVZ reg16/32, mem16/32 mm-xxx-xxx DirectPath CMOVG/CMOVNLE reg16/32, reg16/32 11-xxx-xxx DirectPath CMOVG/CMOVNLE reg16/32, mem16/32...
Page 209 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type CMP EAX, imm16/32 DirectPath CMP mreg8, imm8 11-111-xxx DirectPath CMP mem8, imm8 mm-111-xxx DirectPath CMP mreg16/32, imm16/32...
Page 210 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type DIV EAX, mreg16/32 11-110-xxx VectorPath DIV EAX, mem16/32 mm-110-xxx VectorPath ENTER VectorPath IDIV mreg8 11-111-xxx VectorPath IDIV mem8...
Page 211 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type INC mreg8 11-000-xxx DirectPath INC mem8 mm-000-xxx DirectPath INC mreg16/32 11-000-xxx DirectPath INC mem16/32 mm-000-xxx DirectPath INVD...
Page 212 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type JP/JPE near disp16/32 DirectPath JNP/JPO near disp16/32 DirectPath JL/JNGE near disp16/32 DirectPath JNL/JGE near disp16/32 DirectPath JLE/JNG near disp16/32...
Page 213 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type LOOPE/LOOPZ disp8 VectorPath LOOPNE/LOOPNZ disp8 VectorPath LSL reg16/32, mreg16/32 11-xxx-xxx VectorPath LSL reg16/32, mem16/32 mm-xxx-xxx VectorPath LSS reg16/32, mem32/48...
Page 214 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type MOV EDX, imm16/32 DirectPath MOV EBX, imm16/32 DirectPath MOV ESP, imm16/32 DirectPath MOV EBP, imm16/32 DirectPath MOV ESI, imm16/32...
Page 215 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type NOT mem8 mm-010-xx DirectPath NOT mreg16/32 11-010-xxx DirectPath NOT mem16/32 mm-010-xx DirectPath OR mreg8, reg8 11-xxx-xxx DirectPath...
Page 216 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type POP EBX VectorPath POP ESP VectorPath POP EBP VectorPath POP ESI VectorPath POP EDI VectorPath POP mreg 16/32...
Page 217 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type RCL mreg8, 1 11-010-xxx DirectPath RCL mem8, 1 mm-010-xxx DirectPath RCL mreg16/32, 1 11-010-xxx DirectPath RCL mem16/32, 1...
Page 218 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type ROL mreg16/32, 1 11-000-xxx DirectPath ROL mem16/32, 1 mm-000-xxx DirectPath ROL mreg8, CL 11-000-xxx DirectPath ROL mem8, CL...
Page 219 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type SBB mreg16/32, reg16/32 11-xxx-xxx DirectPath SBB mem16/32, reg16/32 mm-xxx-xxx DirectPath SBB reg8, mreg8 11-xxx-xxx DirectPath SBB reg8, mem8...
Page 220 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type SETS mreg8 11-xxx-xxx DirectPath SETS mem8 mm-xxx-xxx DirectPath SETNS mreg8 11-xxx-xxx DirectPath SETNS mem8 mm-xxx-xxx DirectPath SETP/SETPE mreg8...
Page 221 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type SHR mem16/32, imm8 mm-101-xxx DirectPath SHR mreg8, 1 11-101-xxx DirectPath SHR mem8, 1 mm-101-xxx DirectPath SHR mreg16/32, 1...
Page 222 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type SUB reg8, mreg8 11-xxx-xxx DirectPath SUB reg8, mem8 mm-xxx-xxx DirectPath SUB reg16/32, mreg16/32 11-xxx-xxx DirectPath SUB reg16/32, mem16/32...
Page 223 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19. Integer Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Byte Byte Byte Type XADD mreg8, reg8 11-100-xxx VectorPath XADD mem8, reg8 mm-100-xxx VectorPath XADD mreg16/32, reg16/32 11-101-xxx VectorPath XADD mem16/32, reg16/32...
Page 224: Table 20. Mmx™ Instructions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 20. MMX™ Instructions Prefix First ModR/M Decode Instruction Mnemonic FPU Pipe(s) Notes Byte(s) Byte Byte Type EMMS DirectPath FADD/FMUL/FSTORE MOVD mmreg, reg32 11-xxx-xxx VectorPath MOVD mmreg, mem32 mm-xxx-xxx DirectPath FADD/FMUL/FSTORE MOVD reg32, mmreg...
Page 225 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 20. MMX™ Instructions (Continued) Prefix First ModR/M Decode Instruction Mnemonic FPU Pipe(s) Notes Byte(s) Byte Byte Type PANDN mmreg1, mmreg2 11-xxx-xxx DirectPath FADD/FMUL PANDN mmreg, mem64 mm-xxx-xxx DirectPath FADD/FMUL PCMPEQB mmreg1, mmreg2...
Page 226 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 20. MMX™ Instructions (Continued) Prefix First ModR/M Decode Instruction Mnemonic FPU Pipe(s) Notes Byte(s) Byte Byte Type PSRAW mmreg1, mmreg2 11-xxx-xxx DirectPath FADD/FMUL PSRAW mmreg, mem64 mm-xxx-xxx DirectPath FADD/FMUL PSRAW mmreg, imm8...
Page 227: Table 21. Mmx Extensions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 20. MMX™ Instructions (Continued) Prefix First ModR/M Decode Instruction Mnemonic FPU Pipe(s) Notes Byte(s) Byte Byte Type PUNPCKHDQ mmreg1, mmreg2 11-xxx-xxx DirectPath FADD/FMUL PUNPCKHDQ mmreg, mem64 mm-xxx-xxx DirectPath FADD/FMUL PUNPCKHWD mmreg1, mmreg2...
Page 228: Table 22. Floating-Point Instructions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 21. MMX™ Extensions (Continued) Prefix First ModR/M Decode Instruction Mnemonic Notes Byte(s) Byte Byte Type Pipe(s) PMINSW mmreg, mem64 EAh mm-xxx-xxx DirectPath FADD/FMUL PMINUB mmreg1, mmreg2 11-xxx-xxx DirectPath FADD/FMUL PMINUB mmreg, mem64...
Page 229 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 22. Floating-Point Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Note Byte Byte Byte Type Pipe(s) FCMOVB ST(0), ST(i) DAh C0-C7h VectorPath FCMOVE ST(0), ST(i) DAh C8-CFh VectorPath FCMOVBE ST(0), ST(i)
Page 230 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 22. Floating-Point Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Note Byte Byte Byte Type Pipe(s) FIADD [mem32int] mm-000-xxx VectorPath FIADD [mem16int] mm-000-xxx VectorPath FICOM [mem32int] mm-010-xxx VectorPath FICOM [mem16int] mm-010-xxx VectorPath...
Page 231 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 22. Floating-Point Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Note Byte Byte Byte Type Pipe(s) FLDCW [mem16] mm-101-xxx VectorPath FLDENV [mem14byte] mm-100-xxx VectorPath FLDENV [mem28byte] mm-100-xxx VectorPath FLDL2E DirectPath FSTORE...
Page 232 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 22. Floating-Point Instructions (Continued) First Second ModR/M Decode Instruction Mnemonic Note Byte Byte Byte Type Pipe(s) FSTCW [mem16] mm-111-xxx VectorPath FSTENV [mem14byte] mm-110-xxx VectorPath FSTENV [mem28byte] mm-110-xxx VectorPath FSTP [mem32real] mm-011-xxx DirectPath...
Page 233: Table 23. 3Dnow!™ Instructions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 23. 3DNow!™ Instructions Prefix ModR/M Decode Instruction Mnemonic imm8 Note Byte(s) Byte Type Pipe(s) FEMMS DirectPath FADD/FMUL/FSTORE PAVGUSB mmreg1, mmreg2 0Fh, 0Fh 11-xxx-xxx DirectPath FADD/FMUL PAVGUSB mmreg, mem64 0Fh, 0Fh mm-xxx-xxx DirectPath...
Page 234: Table 24. 3Dnow! Extensions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 23. 3DNow!™ Instructions (Continued) Prefix ModR/M Decode Instruction Mnemonic imm8 Note Byte(s) Byte Type Pipe(s) PFRSQRT mmreg, mem64 0Fh, 0Fh mm-xxx-xxx DirectPath FMUL PFSUB mmreg1, mmreg2 0Fh, 0Fh 11-xxx-xxx DirectPath FADD...
Page 235: Appendix G Directpath Versus Vectorpath Instructions
DirectPath Instructions The following tables contain DirectPath instructions, which should be used in the AMD Athlon processor wherever possible: Table 25, “DirectPath Integer Instructions,” on page 220 Table 26, “DirectPath MMX™ Instructions,” on page 227 and Table 27, “DirectPath MMX™ Extensions,” on page 228 Table 28, “DirectPath Floating-Point Instructions,”...
Page 236 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions Instruction Mnemonic Instruction Mnemonic AND mreg16/32, reg16/32 ADC mreg8, reg8 AND mem16/32, reg16/32 ADC mem8, reg8 AND reg8, mreg8 ADC mreg16/32, reg16/32...
Page 237 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued) Instruction Mnemonic Instruction Mnemonic CMOVBE/CMOVNA reg16/32, reg16/32 CMP AL, imm8 CMOVBE/CMOVNA reg16/32, mem16/32 CMP EAX, imm16/32 CMOVE/CMOVZ reg16/32, reg16/32 CMP mreg8, imm8...
Page 238 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued) Instruction Mnemonic Instruction Mnemonic JNO short disp8 JMP near mreg16/32 (indirect) JB/JNAE short disp8 JMP near mem16/32 (indirect) JNB/JAE short disp8...
Page 239 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued) Instruction Mnemonic Instruction Mnemonic MOV mem16/32, imm16/32 PUSH EAX MOVSX reg16/32, mreg8 PUSH ECX MOVSX reg16/32, mem8 PUSH EDX MOVSX reg32, mreg16...
Page 240 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued) Instruction Mnemonic Instruction Mnemonic ROL mreg8, CL SBB reg16/32, mreg16/32 ROL mem8, CL SBB reg16/32, mem16/32 ROL mreg16/32, CL SBB AL, imm8...
Page 241 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued) Instruction Mnemonic Instruction Mnemonic SUB mem8, reg8 SETL/SETNGE mreg8 SETL/SETNGE mem8 SUB mreg16/32, reg16/32 SUB mem16/32, reg16/32 SETGE/SETNL mreg8 SUB reg8, mreg8...
Page 242 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 25. DirectPath Integer Instructions (Continued) Instruction Mnemonic XOR reg16/32, mem16/32 XOR AL, imm8 XOR EAX, imm16/32 XOR mreg8, imm8 XOR mem8, imm8 XOR mreg16/32, imm16/32 XOR mem16/32, imm16/32 XOR mreg16/32, imm8 (sign extended)
Page 243 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 26. DirectPath MMX™ Instructions Table 26. DirectPath MMX™ Instructions (Continued) Instruction Mnemonic Instruction Mnemonic EMMS PCMPEQD mmreg, mem64 MOVD mmreg, mem32 PCMPEQW mmreg1, mmreg2 MOVD mem32, mmreg PCMPEQW mmreg, mem64 MOVQ mmreg1, mmreg2...
Page 244: Table 27. Directpath Mmx Extensions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 26. DirectPath MMX™ Instructions (Continued) Table 26. DirectPath MMX™ Instructions (Continued) Instruction Mnemonic Instruction Mnemonic PSRLD mmreg, imm8 PXOR mmreg, mem64 PSRLQ mmreg1, mmreg2 PSRLQ mmreg, mem64 Table 27. DirectPath MMX™ Extensions...
Page 245: Table 28. Directpath Floating-Point Instructions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 28. DirectPath Floating-Point Instructions Table 28. DirectPath Floating-Point Instructions Instruction Mnemonic Instruction Mnemonic FIST [mem32int] FABS FADD ST, ST(i) FISTP [mem16int] FISTP [mem32int] FADD [mem32real] FISTP [mem64int] FADD ST(i), ST FADD [mem64real]...
Page 246 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 28. DirectPath Floating-Point Instructions Instruction Mnemonic FSUB ST(i), ST FSUBP ST, ST(i) FSUBR [mem32real] FSUBR [mem64real] FSUBR ST, ST(i) FSUBR ST(i), ST FSUBRP ST(i), ST FTST FUCOM FUCOMP FUCOMPP FWAIT FXCH...
Page 247: Vectorpath Instructions
22007E/0—November 1999 VectorPath Instructions The following tables contain VectorPath instructions, which should be avoided in the AMD Athlon processor: Table 29, “VectorPath Integer Instructions,” on page 231 Table 30, “VectorPath MMX™ Instructions,” on page 234 and Table 31, “VectorPath MMX™ Extensions,” on page 234 Table 32, “VectorPath Floating-Point Instructions,”...
Page 248 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 29. VectorPath Integer Instructions (Continued) Table 29. VectorPath Integer Instructions (Continued) Instruction Mnemonic Instruction Mnemonic DIV EAX, mem16/32 LEA reg16, mem16/32 ENTER LEAVE IDIV mreg8 LES reg16/32, mem32/48 IDIV mem8 LFS reg16/32, mem32/48...
Page 249 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 29. VectorPath Integer Instructions (Continued) Table 29. VectorPath Integer Instructions (Continued) Instruction Mnemonic Instruction Mnemonic MUL EAX, mem32 RCL mem8, imm8 OUT imm8, AL RCL mem16/32, imm8 OUT imm8, AX RCL mem8, CL...
Page 250: Table 31. Vectorpath Mmx Extensions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 29. VectorPath Integer Instructions (Continued) Table 30. VectorPath MMX™ Instructions Instruction Mnemonic Instruction Mnemonic MOVD mmreg, mreg32 STOSB mem8, AL MOVD mreg32, mmreg STOSW mem16, AX Table 31. VectorPath MMX™ Extensions...
Page 251 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 32. VectorPath Floating-Point Instructions Table 32. VectorPath Floating-Point Instructions (Continued) Instruction Mnemonic Instruction Mnemonic F2XM1 FLDENV [mem14byte] FBLD [mem80] FLDENV [mem28byte] FBSTP [mem80] FPTAN FCLEX FPATAN FCMOVB ST(0), ST(i) FRNDINT FCMOVE ST(0), ST(i)
Page 252 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 VectorPath Instructions...
Page 253: Index
Far Control Transfer Instructions..... 65 AMD Athlon™ System Bus ......139 Fetch and Decode Pipeline Stages .
Page 254 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Instruction MOVZX and MOVSX Instructions ....73 Cache ......... 131 MSR Access .
Page 255 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 TBYTE Variables ........55 –...
Page 256 AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Index...

AMD Athlon Processor x86 Optimization Manual

1 Introduction

2 Top Optimizations

100 C Source Level Optimizations

4 Instruction Decoding Optimizations

5 Cache and Memory Optimizations

6 Branch Optimizations

7 Scheduling Optimizations

8 Integer Optimizations

9 Floating-Point Optimizations

3 Dnow!™ and MMX™ Optimizations

11 General X86 Optimization Guidelines

Appendix A

AMD Athlon™ Processor Microarchitecture

Appendix B Pipeline and Execution Unit Resources Overview

Appendix C Implementation of Write Combining

Appendix D Performance-Monitoring Counters

Appendix E Programming the MTRR and PAT

Appendix F Instruction Dispatch and Execution Resources

Appendix G Directpath Versus Vectorpath Instructions

Index

Quick Links

Need help?

Questions and answers

Subscribe to Our Youtube Channel

Related Manuals for AMD Athlon Processor x86

Summary of Contents for AMD Athlon Processor x86