Sign In
Upload
Manuals
Brands
AMD Manuals
Computer Hardware
Athlon Processor x86
AMD Athlon Processor x86 Manuals
Manuals and User Guides for AMD Athlon Processor x86. We have
1
AMD Athlon Processor x86 manual available for free PDF download: Optimization Manual
AMD Athlon Processor x86 Optimization Manual (256 pages)
x86 Code Optimization
Brand:
AMD
| Category:
Computer Hardware
| Size: 3 MB
Table of Contents
Table of Contents
3
Revision History
15
1 Introduction
17
About this Document
17
AMD Athlon™ Processor Family
19
AMD Athlon Processor Microarchitecture Summary
20
2 Top Optimizations
23
Optimization Star
24
Group I Optimizations - Essential Optimizations
24
Memory Size and Alignment Issues
24
Instructions
24
Select Directpath over Vectorpath Instructions
25
Group II Optimizations-Secondary Optimizations
25
Load-Execute Instruction Usage
25
Take Advantage of Write Combining
26
Use 3Dnow! Instructions
26
Avoid Branches Dependent on Random Data
26
Avoid Placing Code and Data in the same 64-Byte Cache Line
27
100 C Source Level Optimizations
29
Ensure Floating-Point Variables and Expressions
29
Are of Type Float
29
Use 32-Bit Data Types for Integer Code
29
Consider the Sign of Integer Operands
30
Use Array Style Instead of Pointer Style Code
31
Completely Unroll Small Loops
34
Avoid Unnecessary Store-To-Load Dependencies
34
Consider Expression Order in Compound Branch Conditions
36
Switch Statement Usage
37
Optimize Switch Statements
37
Use Prototypes for All Functions
37
Use Const Type Qualifier
38
Generic Loop Hoisting
38
Generalization for Multiple Constant Control Code
39
Declare Local Functions as Static
40
Dynamic Memory Allocation Consideration
41
Introduce Explicit Parallelism into Code
41
Explicitly Extract Common Subexpressions
42
C Language Structure Component Considerations
43
Sort Local Variables According to Base Type Size
44
Accelerating Floating-Point Divides and Square Roots
45
Avoid Unnecessary Integer Division
47
Copy Frequently De-Referenced Pointer Arguments to
47
Local Variables
47
4 Instruction Decoding Optimizations
49
Overview
49
Select Directpath over Vectorpath Instructions
50
Load-Execute Instruction Usage
50
Use Load-Execute Integer Instructions
50
Use Load-Execute Floating-Point Instructions with Floating-Point Operands
51
Avoid Load-Execute Floating-Point Instructions with Integer Operands
51
Align Branch Targets in Program Hot Spots
52
Use Short Instruction Lengths
52
Avoid Partial Register Reads and Writes
53
Replace Certain SHLD Instructions with Alternative Code
54
Use 8-Bit Sign-Extended Immediates
54
Use 8-Bit Sign-Extended Displacements
55
Code Padding Using Neutral Code Fillers
55
Recommendations for the AMD Athlon Processor
56
Recommendations for AMD-K6 ® Family and AMD Athlon Processor Blended Code
57
5 Cache and Memory Optimizations
61
Memory Size and Alignment Issues
61
Avoid Memory Size Mismatches
61
Align Data Where Possible
62
Use the 3Dnow! PREFETCH and PREFETCHW Instructions
62
Take Advantage of Write Combining
66
Avoid Placing Code and Data in the same 64-Byte Cache Line
66
Store-To-Load Forwarding Restrictions
67
Store-To-Load Forwarding Pitfalls-True Dependencies
67
Summary of Store-To-Load Forwarding Pitfalls to Avoid
70
Stack Alignment Considerations
70
Align TBYTE Variables on Quadword Aligned Addresses
71
C Language Structure Component Considerations
71
Sort Variables According to Base Type Size
72
6 Branch Optimizations
73
Avoid Branches Dependent on Random Data
73
AMD Athlon Processor Specific Code
74
Blended AMD-K6 and AMD Athlon Processor Code
74
Always Pair CALL and RETURN
75
Replace Branches with Computation in 3Dnow! Code
76
Muxing Constructs
76
Sample Code Translated into 3Dnow! Code
77
Avoid the Loop Instruction
81
Avoid Far Control Transfer Instructions
81
Avoid Recursive Functions
82
7 Scheduling Optimizations
83
Schedule Instructions According to Their Latency
83
Unrolling Loops
83
Complete Loop Unrolling
83
Partial Loop Unrolling
84
Use Function Inlining
87
Overview
87
Always Inline Functions if Called from One Site
88
Instructions
88
Avoid Address Generation Interlocks
88
Use MOVZX and MOVSX
89
Minimize Pointer Arithmetic in Loops
89
Push Memory Data Carefully
91
8 Integer Optimizations
93
Replace Divides with Multiplies
93
Multiplication by Reciprocal (Division) Utility
93
Unsigned Division by Multiplication of Constant
94
Signed Division by Multiplication of Constant
95
Use Alternative Code When Multiplying by a Constant
97
Use MMX™ Instructions for Integer-Only Work
99
Repeated String Instruction Usage
100
Latency of Repeated String Instructions
100
Guidelines for Repeated String Instructions
100
Table 1. Latency of Repeated String Instructions
100
Use XOR Instruction to Clear Integer Registers
102
Efficient 64-Bit Integer Arithmetic
102
Efficient Implementation of Population Count Function
107
By Constants
109
Shift Factor
109
Shift Factor
111
9 Floating-Point Optimizations
113
Ensure All FPU Data Is Aligned
113
Use Multiplies Rather than Divides
113
Use FFREEP Macro to Pop One Register from the FPU Stack
114
Floating-Point Compare Instructions
114
Use the FXCH Instruction Rather than FST/FLD Pairs
115
Avoid Using Extended-Precision Data
115
Minimize Floating-Point-To-Integer Conversions
116
Floating-Point Subexpression Elimination
119
Check Argument Range of Trigonometric Instructions
119
Efficiently
119
Take Advantage of the FSINCOS Instruction
121
3 Dnow!™ and MMX™ Optimizations
123
Use 3Dnow! Instructions
123
Use FEMMS Instruction
123
Use 3Dnow! Instructions for Fast Division
124
Optimized 14-Bit Precision Divide
124
Optimized Full 24-Bit Precision Divide
124
Pipelined Pair of 24-Bit Precision Divides
125
Newton-Raphson Reciprocal
125
Use 3Dnow! Instructions for Fast Square Root and Reciprocal Square Root
126
Optimized 15-Bit Precision Square Root
126
Optimized 24-Bit Precision Square Root
126
Newton-Raphson Reciprocal Square Root
127
Use MMX PMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel
127
3Dnow! and MMX Intra-Operand Swapping
128
Fast Conversion of Signed Words to Floating-Point
129
Use MMX PXOR to Negate 3Dnow! Data
129
Use MMX PCMP Instead of 3Dnow! PFCMP
130
Use MMX Instructions for Block Copies and Block Fills
131
Use MMX PXOR to Clear All Bits in an MMX Register
134
Use MMX PCMPEQD to Set All Bits in an MMX Register
135
Use MMX PAND to Find Absolute Value in 3Dnow! Code
135
Optimized Matrix Multiplication
135
Efficient 3D-Clipping Code Computation Using 3Dnow! Instructions
138
Use 3Dnow! PAVGUSB for MPEG-2 Motion Compensation
139
Stream of Packed Unsigned Bytes
141
Complex Number Arithmetic
142
11 General X86 Optimization Guidelines
143
Short Forms
143
Dependencies
144
Register Operands
144
Stack Allocation
144
Appendix A
145
AMD Athlon Processor Microarchitecture
146
Superscalar Processor
146
Instruction Cache
147
Figure 1. AMD Athlon™ Processor Block Diagram
147
Predecode
148
Branch Prediction
148
Early Decoding
149
Instruction Control Unit
150
Data Cache
150
Integer Scheduler
151
Integer Execution Unit
151
Figure 2. Integer Execution Pipeline
151
Floating-Point Scheduler
152
Floating-Point Execution Unit
153
Figure 3. Floating-Point Unit Block Diagram
153
Load-Store Unit (LSU)
154
Figure 4. Load/Store Unit
154
L2 Cache Controller
155
Write Combining
155
AMD Athlon System Bus
155
AMD Athlon™ Processor Microarchitecture
145
Introduction
145
Appendix B Pipeline and Execution Unit Resources Overview
157
Fetch and Decode Pipeline Stages
157
Figure 5. Fetch/Scan/Align/Decode Pipeline Hardware
158
Figure 6. Fetch/Scan/Align/Decode Pipeline Stages
158
Integer Pipeline Stages
160
Figure 7. Integer Execution Pipeline
160
Figure 8. Integer Pipeline Stages
160
Floating-Point Pipeline Stages
162
Figure 9. Floating-Point Unit Block Diagram
162
Figure 10. Floating-Point Pipeline Stages
162
Execution Unit Resources
164
Terminology
164
Integer Pipeline Operations
165
Table 2. Integer Pipeline Operation Types
165
Table 3. Integer Decode Types
165
Floating-Point Pipeline Operations
166
Table 4. Floating-Point Pipeline Operation Types
166
Table 5. Floating-Point Decode Types
166
Load/Store Pipeline Operations
167
Table 6. Load/Store Unit Stages
167
Code Sample Analysis
168
Table 7. Sample 1 - Integer Register Operations
169
Table 8. Sample 2 - Integer Register and Memory Load
170
Appendix C Implementation of Write Combining
171
Introduction
171
Write-Combining Definitions and Abbreviations
172
What Is Write Combining
172
Programming Details
172
Write-Combining Operations
173
Table 9. Write Combining Completion Events
174
Sending Write-Buffer Data to the System
175
Table 10. AMD Athlon™ System Bus Commands
175
Appendix D Performance-Monitoring Counters
177
Overview
177
Performance Counter Usage
177
Perfevtsel[3:0] Msrs (MSR Addresses C001_0000H-C001_0003H)
178
Figure 11. Perfevtsel[3:0] Registers
178
Table 11. Performance-Monitoring Counters
180
Counters
184
Event and Time-Stamp Monitoring Software
184
Monitoring Counter Overflow
185
Appendix E Programming the MTRR and PAT
187
Introduction
187
Memory Type Range Register (MTRR) Mechanism
187
Figure 12. MTRR Mapping of Physical Memory
189
Figure 13. MTRR Capability Register Format
190
Table 12. Memory Type Encodings
190
Figure 14. MTRR Default Type Register Format
191
Table 13. Standard MTRR Types and Properties
192
Page Attribute Table (PAT)
193
Figure 15. Page Attribute Table (MSR 277H)
193
Table 14. Pati 3-Bit Encodings
194
Table 15. Effective Memory Type Based on PAT and
195
Table 16. Final Output Memory Types
196
Table 17. MTRR Fixed Range Register Format
198
Figure 16. Mtrrphysbasen Register Format
199
Figure 17. Mtrrphysmaskn Register Format
200
Table 18. MTRR-Related Model-Specific Register
201
Appendix F Instruction Dispatch and Execution Resources
203
Table 19. Integer Instructions
204
Table 20. MMX™ Instructions
224
Table 21. MMX Extensions
227
Table 22. Floating-Point Instructions
228
Table 23. 3Dnow!™ Instructions
233
Table 24. 3Dnow! Extensions
234
Appendix G Directpath Versus Vectorpath Instructions
235
Select Directpath over Vectorpath Instructions
235
Directpath Instructions
235
Table 25. Directpath Integer Instructions
235
Table 26. Directpath MMX Instructions
235
Table 27. Directpath MMX Extensions
244
Table 28. Directpath Floating-Point Instructions
245
Vectorpath Instructions
247
Table 29. Vectorpath Integer Instructions
247
Table 30. Vectorpath MMX Instructions
247
Table 32. Vectorpath Floating-Point Instructions
247
Table 31. Vectorpath MMX Extensions
250
Index
253
Advertisement
Advertisement
Related Products
AMD Athlon XP 10
AMD ATI Radeon x1700 FSC User's guide
AMD -K6-2/450 - MHz Processor
AMD 3200 - Athlon 64 2.0 GHz Processor
AMD 3800 - Processor - 1 x Athlon 64
AMD 770
AMD 780E
AMD A2210
AMD A8440
AMD A85X Chipset
AMD Categories
Video Card
Computer Hardware
Motherboard
Microcontrollers
Controller
More AMD Manuals
Login
Sign In
OR
Sign in with Facebook
Sign in with Google
Upload manual
Upload from disk
Upload from URL