Download Print this page

Intel ARCHITECTURE IA-32 Reference Manual

Architecture optimization

Table Of Contents

page of 568

/ 568
Contents
Table of Contents
Bookmarks

Table of Contents

Advertisement

Quick Links

Download this manual

IA-32 Intel® Architecture

Optimization Reference

Manual

Order Number: 248966-013US

April 2006

Table of Contents

Advertisement

Table of Contents

Need help?

Need help?

Do you have a question about the ARCHITECTURE IA-32 and is the answer not in the manual?

Questions and answers

Summary of Contents for Intel ARCHITECTURE IA-32

Page 1 IA-32 Intel® Architecture Optimization Reference Manual Order Number: 248966-013US April 2006...
Page 2 Intel may make changes to specifications and product descriptions at any time, without notice. This IA-32 Intel ® Architecture Optimization Reference Manual as well as the software described in it is furnished under license and may only be used or copied in accordance with the terms of the license. The information in this man- ual is furnished for informational use only, is subject to change without notice, and should not be construed as a com- mitment by Intel Corporation.
Page 3: Table Of Contents
Extended Memory 64 Technology (Intel ® Intel NetBurst Microarchitecture... 1-8 Design Goals of Intel NetBurst Microarchitecture ... 1-8 Overview of the Intel NetBurst Microarchitecture Pipeline ... 1-9 The Front End... 1-11 The Out-of-order Core ... 1-12 Retirement ... 1-12 Front End Pipeline Detail... 1-13 Prefetching...
Page 4 Execution Core ... 1-39 Retirement ... 1-39 Multi-Core Processors... 1-39 Microarchitecture Pipeline and Multi-Core Processors... 1-42 Shared Cache in Intel Core Duo Processors ... 1-42 Load and Store Operations... 1-42 Chapter 2 General Optimization Guidelines Tuning to Achieve Optimum Performance ... 2-1 Tuning to Prevent Known Coding Pitfalls ...
Page 5 Floating-point Exceptions ... 2-60 Floating-point Modes ... 2-62 Improving Parallelism and the Use of FXCH ... 2-68 x87 vs. Scalar SIMD Floating-point Trade-offs ... 2-69 Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core Duo Processors... 2-70 Memory Operands... 2-71 ®...
Page 6 Floating-Point Stalls... 2-72 x87 Floating-point Operations with Integer Operands ... 2-72 x87 Floating-point Comparison Instructions ... 2-72 Transcendental Functions ... 2-72 Instruction Selection... 2-73 Complex Instructions ... 2-74 Use of the lea Instruction... 2-74 Use of the inc and dec Instructions ... 2-75 Use of the shift and rotate Instructions ...
Page 7 Considerations for Code Conversion to SIMD Programming... 3-8 Identifying Hot Spots ... 3-10 Determine If Code Benefits by Conversion to SIMD Execution... 3-11 Coding Techniques ... 3-12 Coding Methodologies... 3-13 Assembly ... 3-15 Intrinsics... 3-15 Classes ... 3-17 Automatic Vectorization ... 3-18 Stack and Data Alignment...
Page 8 Packed Shuffle Word for 64-bit Registers ... 4-18 Packed Shuffle Word for 128-bit Registers ... 4-19 Unpacking/interleaving 64-bit Data in 128-bit Registers... 4-20 Data Movement ... 4-21 Conversion Instructions ... 4-21 Generating Constants ... 4-21 Building Blocks... 4-23 Absolute Difference of Unsigned Numbers ... 4-23 Absolute Difference of Signed Numbers ...
Page 9 Data Alignment... 5-4 Data Arrangement ... 5-4 Vertical versus Horizontal Computation... 5-5 Data Swizzling ... 5-9 Data Deswizzling ... 5-14 Using MMX Technology Code for Copy or Shuffling Functions ... 5-17 Horizontal ADD Using SSE... 5-18 Use of cvttps2pi/cvttss2si Instructions ... 5-21 Flush-to-Zero and Denormals-are-Zero Modes ...
Page 10 Hardware Prefetch ... 6-19 Example of Effective Latency Reduction with H/W Prefetch ... 6-20 Example of Latency Hiding with S/W Prefetch Instruction ... 6-22 Software Prefetching Usage Checklist ... 6-24 Software Prefetch Scheduling Distance ... 6-25 Software Prefetch Concatenation... 6-26 Minimize Number of Software Prefetches ...
Page 11 Key Practices of System Bus Optimization ... 7-17 Key Practices of Memory Optimization ... 7-17 Key Practices of Front-end Optimization ... 7-18 Key Practices of Execution Resource Optimization ... 7-18 Generality and Performance Impact... 7-19 Thread Synchronization ... 7-19 Choice of Synchronization Primitives ... 7-20 Synchronization for Short Periods ...
Page 12 Guidelines for Extending Battery Life... 9-7 Adjust Performance to Meet Quality of Features ... 9-8 Reducing Amount of Work... 9-9 Platform-Level Optimizations... 9-10 Handling Sleep State Transitions ... 9-11 Using Enhanced Intel SpeedStep ® Enabling Intel Enhanced Deeper Sleep ... 9-14 Multi-Core Considerations ... 9-15 Enhanced Intel SpeedStep Thread Migration Considerations...
Page 13 Counter Monitor... A-14 ® Intel Tuning Assistant ... A-14 ® Intel Performance Libraries... A-14 Benefits Summary ... A-15 Optimizations with the Intel Enhanced Debugger (EDB) ... A-17 ® Intel Threading Tools... A-17 ® Intel Thread Checker... A-17 Thread Profiler... A-19 ®...
Page 14 Using Performance Metrics with Hyper-Threading Technology ... B-50 Using Performance Events of Intel Core Solo and Intel Core Duo processors... B-56 Understanding the Results in a Performance Counter ... B-56 Ratio Interpretation ... B-57 Notes on Selected Events ... B-58 Appendix C IA-32 Instruction Latency and Throughput Overview ...
Page 15 Examples Example 2-1 Assembly Code with an Unpredictable Branch ... 2-17 Example 2-2 Code Optimization to Eliminate Branches ... 2-17 Example 2-3 Eliminating Branch with CMOV Instruction... 2-18 Example 2-4 Use of pause Instruction ... 2-19 Example 2-5 Pentium 4 Processor Static Branch Prediction Algorithm... 2-20 Example 2-6 Static Taken Prediction Example ...
Page 16 Example 3-4 Identification of SSE2 with cpuid ... 3-5 Example 3-5 Identification of SSE2 by the OS ... 3-6 Example 3-6 Identification of SSE3 with cpuid ... 3-7 Example 3-7 Identification of SSE3 by the OS ... 3-8 Example 3-8 Simple Four-Iteration Loop ...
Page 17 Example 4-20 Clipping to an Arbitrary Signed Range [high, low]... 4-27 Example 4-21 Simplified Clipping to an Arbitrary Signed Range ... 4-28 Example 4-22 Clipping to an Arbitrary Unsigned Range [high, low]... 4-29 Example 4-23 Complex Multiply by a Constant ... 4-32 Example 4-24 A Large Load after a Series of Small Stores (Penalty)...
Page 18 Example 6-12 Memory Copy Using Hardware Prefetch and Bus Segmentation.. 6-50 Example 7-1 Serial Execution of Producer and Consumer Work Items ... 7-9 Example 7-2 Basic Structure of Implementing Producer Consumer Threads ... 7-11 Example 7-3 Thread Function for an Interlaced Producer Consumer Model ... 7-13 Example 7-4 Spin-wait Loop and PAUSE Instructions...
Page 19 The Intel NetBurst Microarchitecture ... 1-10 Figure 1-4 Execution Units and Ports in the Out-Of-Order Core... 1-19 Figure 1-5 The Intel Pentium M Processor Microarchitecture ... 1-27 Figure 1-6 Hyper-Threading Technology on an SMP... 1-35 Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition and Intel Core Duo Processor ...
Page 20 Sampling Analysis of Hotspots by Location...A-10 Figure A-2 Intel Thread Checker Can Locate Data Race Conditions...A-18 Figure A-3 Intel Thread Profiler Can Show Critical Paths of Threaded Execution Timelines...A-20 Figure B-1 Relationships Between the Cache Hierarchy, IOQ, BSQ and Front Side Bus ...B-10 Figure D-1 Stack Frames Based on Alignment Type ...
Page 21 Tables Table 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters ... 1-20 Table 1-3 Cache Parameters of Pentium M, Intel ® Intel Core™ Duo Processors ... 1-30 Table 1-2 Trigger Threshold and CPUID Signatures for IA-32 Processor Families ... 1-30 Table 1-4 Family And Model Designations of Microarchitectures...
Page 22 Table C-5 Streaming SIMD Extension 64-bit Integer Instructions... C-14 Table C-7 IA-32 x87 Floating-point Instructions... C-16 Table C-8 IA-32 General Purpose Instructions ... C-17 xxii...
Page 23 The target audience for this manual includes software programmers and compiler writers. This manual assumes that the reader is familiar with the ® basics of the IA-32 architecture and has access to the Intel Architecture Software Developer’s Manual: Volume 1, Basic Architecture;...
Page 24: About This Manual
® The Intel VTune™ Performance Analyzer can help you analyze and locate hot-spot regions in your applications. On the Pentium 4, Intel ® Xeon and Pentium M processors, this tool can monitor an application through a selection of performance monitoring events and analyze the performance event data that is gathered during code execution.
Page 25: Introduction
Chapter 2: General Optimization Guidelines. Describes general code development and optimization techniques that apply to all applications designed to take advantage of the common features of the Intel NetBurst microarchitecture and Pentium M processor microarchitecture. Chapter 3: Coding for SIMD Architectures. Describes techniques and concepts for using the SIMD integer and SIMD floating-point instructions provided by the MMX™...
Page 26 Appendix A: Application Performance Tools. Introduces tools for analyzing and enhancing application performance without having to write assembly code. Appendix B: Intel Pentium 4 Processor Performance Metrics. Provides information that can be gathered using Pentium 4 processor’s performance monitoring events. These performance metrics can help programmers determine how effectively an application is using the features of the Intel NetBurst microarchitecture.
Page 27: Related Documentation
Related Documentation For more information on the Intel architecture, specific techniques, and processor architecture terminology referenced in this manual, see the following documents: • ® Intel C++ Compiler User’s Guide • ® Intel Fortran Compiler User’s Guide • VTune Performance Analyzer online help •...
Page 28: Notational Conventions
Notational Conventions This manual uses the following conventions: This type style THIS TYPE STYLE This type style (ellipses) This type style xxviii Indicates an element of syntax, a reserved word, a keyword, a filename, instruction, computer output, or part of a program example.
Page 29 HT Technology and an HT Technology enabled chipset, BIOS and operating system. Performance varies depending on the hardware and software used. Dual-core platform requires an Intel Core Duo, Pentium D processor or Pentium processor Extreme Edition, with appropriate chipset, BIOS, and operating system. Performance varies depending on the hardware and software used.
Page 30: Simd Technology
Intel Core Solo and Intel Core Duo processors incorporate microarchitectural enhancements for performance and power efficiency that are in addition to those introduced in the Pentium M processor. SIMD Technology SIMD computations (see Figure 1-1) were introduced in the IA-32 architecture with MMX technology.
Page 31: Figure 1-1 Typical Simd Operations
IA-32 Intel® Architecture Processor Family Overview each corresponding pair of data elements (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The results of the four parallel computations are sorted as a set of four packed data elements.
Page 32: Figure 1-2 Simd Instruction Register Usage
IA-32 execution modes: protected mode, real address mode, and Virtual 8086 mode. SSE, SSE2, and MMX technologies are architectural extensions in the IA-32 Intel architecture. Existing software will continue to run correctly, without modification on IA-32 microprocessors that incorporate these technologies. Existing software will also run correctly in the presence of applications that incorporate SIMD technologies.
Page 33: Summary Of Simd Technologies
For more on SSE, SSE2, SSE3 and MMX technologies, see: IA-32 Intel® Architecture Software Developer’s Manual, Volume 1: Chapter 9, “Programming with Intel® MMX™ Technology”; Chapter 10, “Programming with Streaming SIMD Extensions (SSE)”;...
Page 34: Streaming Simd Extensions 2
SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, and video encoding and decoding. Streaming SIMD Extensions 2 Streaming SIMD extensions 2 add the following: • 128-bit data type with two packed double-precision floating-point operands • 128-bit data types for SIMD integer operation on 16-byte, 8-word, 4-doubleword, or 2-quadword integers •...
Page 35: Intel ® Extended Memory 64 Technology (Intel ® Em64T)
® (Intel EM64T) Intel EM64T is an extension of the IA-32 Intel architecture. Intel EM64T increases the linear address space for software to 64 bits and supports physical address space up to 40 bits. The technology also introduces a new operating mode referred to as IA-32e mode.
Page 36: Intel Netburst ® Microarchitecture
Intel NetBurst The Pentium 4 processor, Pentium 4 processor Extreme Edition supporting Hyper-Threading Technology, Pentium D processor, Pentium processor Extreme Edition and the Intel Xeon processor implement the Intel NetBurst microarchitecture. This section describes the features of the Intel NetBurst microarchitecture and its operation common to the above processors.
Page 37: Overview Of The Intel Netburst Microarchitecture Pipeline
• to operate at high clock rates and to scale to higher performance and clock rates in the future Design advances of the Intel NetBurst microarchitecture include: • a deeply pipelined design that allows for high clock rates (with different parts of the chip running at different clock rates).
Page 38: Figure 1-3 The Intel Netburst Microarchitecture
Figure 1-3 illustrates a diagram of the major functional blocks associated with the Intel NetBurst microarchitecture pipeline. The following subsections provide an overview for each. Figure 1-3 The Intel NetBurst Microarchitecture...
Page 39: The Front End
The Front End The front end of the Intel NetBurst microarchitecture consists of two parts: • fetch/decode unit • execution trace cache It performs the following functions: • prefetches IA-32 instructions that are likely to be executed • fetches required instructions that have not been prefetched •...
Page 40: The Out-Of-Order Core
IA-32 Intel® Architecture Optimization The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch targets are predicted based on their linear address using branch prediction logic and fetched as soon as possible. Branch targets are fetched from the execution trace cache if they are cached, otherwise they are fetched from the memory hierarchy.
Page 41: Front End Pipeline Detail
(BTB). This updates branch history. Figure 1-3 illustrates the paths that are most frequently executing inside the Intel NetBurst microarchitecture: an execution loop that interacts with multilevel cache hierarchy and the system bus.
Page 42: Decoder
Decoder The front end of the Intel NetBurst microarchitecture has a single decoder that decodes instructions at the maximum rate of one instruction per clock. Some complex instructions must enlist the help of the microcode ROM.
Page 43: Branch Prediction
It enables the processor to begin executing instructions long before the branch outcome is certain. Branch delay is the penalty that is incurred in the absence of correct prediction. For Pentium 4 and Intel Xeon processors, the branch delay for a correctly predicted instruction can be as few as zero clock cycles.
Page 44: Execution Core Detail
To take advantage of the forward-not-taken and backward-taken static predictions, code should be arranged so that the likely target of the branch immediately follows forward branches (see also: “Branch Prediction” in Chapter 2). Branch Target Buffer. Once branch history is available, the Pentium 4 processor can predict the branch outcome even before the branch instruction is decoded.
Page 45: Instruction Latency And Throughput
Appendix C, “IA-32 Instruction Latency and Throughput,” lists some of the more-commonly-used IA-32 instructions with their latency, their issue throughput, and associated execution units (where relevant). Some IA-32 Intel® Architecture Processor Family Overview 1-17...
Page 46: Execution Units And Issue Ports
IA-32 Intel® Architecture Optimization execution units are not pipelined (meaning that µops cannot be dispatched in consecutive cycles and the throughput is less than one per cycle). The number of µops associated with each instruction provides a basis for selecting instructions to generate. All µops executed out of the microcode ROM involve extra overhead.
Page 47: Caches
MMX_MISC handles SIMD reciprocal and som e integer operations Caches The Intel NetBurst microarchitecture supports up to three levels of on-chip cache. At least two levels of on-chip cache are implemented in processors based on the Intel NetBurst microarchitecture. The Intel Xeon processor MP and selected Pentium and Intel Xeon processors may also contain a third-level cache.
Page 48: Table 1-1 Pentium 4 And Intel Xeon Processor Cache Parameters
Each read due to a cache miss fetches a sector, consisting of two adjacent cache lines; a write operation is 64 bytes. Pentium 4 and Intel Xeon processors with CPUID model encoding value of 2 have a second level cache of 512 KB.
Page 49: Data Prefetch
This approach has the following effect: • minimizes disturbance of temporal data in other cache levels IA-32 Intel® Architecture Processor Family Overview 1-21...
Page 50 • avoids the need to access off-chip caches, which can increase the realized bandwidth compared to a normal load-miss, which returns data to all cache levels Situations that are less likely to benefit from software prefetch are: • for cases that are already bandwidth bound, prefetching tends to increase bandwidth demands •...
Page 51 (stride that is greater than the trigger threshold distance), this can achieve additional benefit of improved temporal locality and reducing cache misses in the last level cache significantly. IA-32 Intel® Architecture Processor Family Overview 1-23...
Page 52: Loads And Stores
Thus, software optimization of a data access pattern should emphasize tuning for hardware prefetch first to favor greater proportions of smaller-stride data accesses in the workload; before attempting to provide hints to the processor by employing software prefetch instructions. Loads and Stores The Pentium 4 processor employs the following techniques to speed up the execution of memory operations: •...
Page 53: Store Forwarding
• Alignment: the store cannot wrap around a cache line boundary, and the linear address of the load must be the same as that of the store IA-32 Intel® Architecture Processor Family Overview 1-25...
Page 54: Intel ® Pentium ® M Processor Microarchitecture
® ® Intel Pentium Like the Intel NetBurst microarchitecture, the pipeline of the Intel Pentium M processor microarchitecture contains three sections: • in-order issue front end • out-of-order superscalar execution core • in-order retirement unit Intel Pentium M processor microarchitecture supports a high-speed system bus (up to 533 MHz) with 64-byte line size.
Page 55: The Front End
Pentium M processor is shown in Figure 1-5. Figure 1-5 The Intel Pentium M Processor Microarchitecture The Front End The Intel Pentium M processor uses a pipeline depth that enables high performance and low power consumption. It’s shorter than that of the Intel NetBurst microarchitecture.
Page 56 The branch prediction hardware includes dynamic prediction, and branch target buffers. The Intel Pentium M processor has enhanced dynamic branch prediction hardware. Branch target buffers (BTB) predict the direction and target of branches based on an instruction’s address.
Page 57: Data Prefetching
MMX technology loads and for most kinds of successive execution operations. Note that SSE Loads can not be fused. Data Prefetching The Intel Pentium M processor supports three prefetching mechanisms: • The first mechanism is a hardware instruction fetcher and is described in the previous section.
Page 58: Out-Of-Order Core
Model ID Data is fetched 64 bytes at a time; the instruction and data translation lookaside buffers support 128 entries. See Table 1-3 for processor cache parameters. Table 1-3 Cache Parameters of Pentium M, Intel ® Intel Core™ Duo Processors Level...
Page 59: In-Order Retirement
Duo processor to minimize bus traffic between two cores accessing a single-copy of cached data. It allows an Intel Core Solo processor (or when one of the two cores in an Intel Core Duo processor is idle) to access its full capacity.
Page 60: Front End
Pentium M processor (see Table 1-2). Front End Execution of SIMD instructions on Intel Core Solo and Intel Core Duo processors are improved over Pentium M processors by the following enhancements: •...
Page 61: Data Prefetching
® Intel Hyper-Threading (HT) Technology is supported by specific members of the Intel Pentium 4 and Xeon processor families. The technology enables software to take advantage of task-level, or thread-level parallelism by providing multiple logical processors within a physical processor package. In its first implementation in Intel Xeon processor, Hyper-Threading Technology makes a single physical processor appear as two logical processors.
Page 62 IA-32 Intel® Architecture Optimization The two logical processors each have a complete set of architectural registers while sharing one single physical processor's resources. By maintaining the architecture state of two processors, an HT Technology capable processor looks like two processors to software, including operating system and application code.
Page 63: Figure 1-6 Hyper-Threading Technology On An Smp
IA-32 Intel® Architecture Processor Family Overview Architectural Architectural State...
Page 64: Processor Resources And Hyper-Threading Technology
(MTRRs) and the performance monitoring resources. For a complete list of the architecture state and exceptions, see the IA-32 Intel® Architecture Software Developer’s Manual, Volumes 3A & 3B. Other resources such as instruction pointers and register renaming tables were replicated to simultaneously track execution and state changes of the two logical processors.
Page 65: Shared Resources
Shared mode: The L1 data cache is fully shared by two logical processors. • Adaptive mode: In adaptive mode, memory accesses using the page directory is mapped identically across logical processors sharing the L1 data cache. The other resources are fully shared. IA-32 Intel® Architecture Processor Family Overview 1-37...
Page 66: Microarchitecture Pipeline And Hyper-Threading Technology
Microarchitecture Pipeline and Hyper-Threading Technology This section describes the HT Technology microarchitecture and how instructions from the two logical processors are handled between the front end and the back end of the pipeline. Although instructions originating from two programs or two threads execute simultaneously and not necessarily in program order in the execution core and memory hierarchy, the front end and back end contain several selection points to select between instructions from the...
Page 67: Execution Core
The Intel Pentium D processor provides two logical processors in a physical package, each logical processor has a separate execution core and a cache hierarchy. The Dual-core Intel Xeon processor and the Intel IA-32 Intel® Architecture Processor Family Overview 1-39...
Page 68 Each core provides two logical processors sharing an execution core and a cache hierarchy. The Intel Core Duo processor provides two logical processors in a physical package. Each logical processor has a separate execution core (including first-level cache) and a smart second-level cache.
Page 69: Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition And Intel Core Duo Processor
Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition and Intel Core Duo Processor IA-32 Intel® Architecture Processor Family Overview Pentium D Processor Architectual State Architectual State Execution Engine Execution Engine Local APIC Caches Bus Interface System Bus Pentium Processor Extreme Edition...
Page 70: Microarchitecture Pipeline And Multi-Core Processors
The Intel Core Duo processor has two symmetric cores that share the second-level cache and a single bus interface (see Figure 1-7). Two threads executing on two cores in an Intel Core Duo processor can take advantage of shared second-level cache, accessing a single-copy of cached data without generating bus traffic.
Page 71: Table
Second-level cache and the first-level cache of the other core Memory Table 1-5 lists the performance characteristics of generic load and store operations in an Intel Core Duo processor. Numeric values of Table 1-5 are in terms of processor core cycles . Table 1-5...
Page 72 IA-32 Intel® Architecture Optimization when data is written back to memory, the eviction consumes cache bandwidth and bus bandwidth. For multiple cache misses that require the eviction of modified lines and are within a short time, there is an overall degradation in response time of these cache misses.
Page 73: Chapter 2 General Optimization Guidelines
Intel compilers. The Intel for IA-32 processor family, provides the most of the optimization. For those not using the Intel C++ or Fortran Compiler, the assembly code tuning optimizations may be useful. The explanations are supported by coding examples.
Page 74: Tuning To Prevent Known Coding Pitfalls
IA-32 processors. Tuning to Prevent Known Coding Pitfalls To produce program code that takes advantage of the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture, you must avoid the coding pitfalls that limit the performance of the target processor family.
Page 75: General Practices And Coding Guidelines
“Tuning to Achieve Optimum Performance” section. It also highlights practices that use performance tools. The majority of these guidelines benefit processors based on the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture. Some guidelines benefit one microarchitecture more than the other.
Page 76: Use Available Performance Tools
— Set this compiler to produce code for the target processor implementation — Use the compiler switches for optimization and/or profile-guided optimization. These features are summarized in the “Intel® C++ Compiler” section. For more detail, see the Intel® C++ Compiler User’s Guide. • Current-generation performance monitoring tools, such as VTune™...
Page 77: Optimize Branch Predictability
Optimize Branch Predictability • Improve branch predictability and optimize instruction prefetching by arranging code to be consistent with the static branch prediction assumption: backward taken and forward not taken. • Avoid mixing near calls, far calls and returns. • Avoid implementing a call by pushing the return address and jumping to the target.
Page 78: Optimize Floating-Point Performance
• Minimize use of global variables and pointers. • Use the const variables. • Use new cacheability instructions and memory-ordering behavior. Optimize Floating-point Performance • Avoid exceeding representable ranges during computation, since handling these cases can have a performance impact. Do not use a larger precision format (double-extended floating point) unless required, since this increases memory size and bandwidth utilization.
Page 79: Optimize Instruction Scheduling
• Avoid longer latency instructions: integer multiplies and divides. Replace them with alternate code sequences (e.g., use shifts instead of multiplies). • Use the address calculation. • Some types of stores use more µops than others, try to use simpler store variants and/or reduce the number of stores.
Page 80: Coding Rules, Suggestions And Tuning Hints
• Avoid the use of conditionals. • Keep induction (loop) variable expressions simple. • Avoid using pointers, try to replace pointers with arrays and indices. Coding Rules, Suggestions and Tuning Hints This chapter includes rules, suggestions and hints. They are maintained in separately-numbered lists and are targeted for engineers who are: •...
Page 81: Performance Tools
Refer to the “Intel C++ Intrinsics Reference” section of the Intel® C++ Compiler User’s Guide. • C++ class libraries. Refer to the “Intel C++ Class Libraries for SIMD Operations Reference” section of the Intel® C++ Compiler User’s Guide. •...
Page 82: General Compiler Recommendations
However, if particular performance problems are noted with the compiled code, some compilers (like the Intel C++ and Fortran Compil- ers) allow the coder to insert intrinsics or inline assembly in order to exert greater control over what code is generated.
Page 83: Processor Perspectives
Processor Perspectives The majority of the coding recommendations for the Pentium 4 and Intel Xeon processors also apply to Pentium M, Intel Core Solo, and Intel Core Duo processors. However, there are situations where a recommendation may benefit one microarchitecture more than the other.
Page 84 CPUID signature family 6, model 9). On Pentium 4, Intel Xeon processors, Pentium M processor (with CPUID signature family 6, model 13), and Intel Core Solo, and Intel Core Duo processors, such penalties are resolved by artificial dependencies between each partial register write.
Page 85: Cpuid Dispatch Strategy And Compatible Code Strategy
• On the Pentium 4 and Intel Xeon processors, the primary code size limit of interest is imposed by the trace cache. On Pentium M processors, code size limit is governed by the instruction cache. • There may be a penalty when instructions with immediates requiring more than 16-bit signed representation are placed next to other instructions that use immediates.
Page 86: Transparent Cache-Parameter Strategy
IA-32 processor families. See CPUID instruction in the IA-32 Intel® Architecture Software Developer’s Manual, Volume 2B. For coding techniques that rely on specific parameters of a cache level,...
Page 87: Branch Prediction
Branch Prediction Branch optimizations have a significant impact on performance. By understanding the flow of branches and improving the predictability of branches, you can increase the speed of code significantly. Optimizations that help branch prediction are: • Keep code and data on separate pages (a very important item, see more details in the “Memory Accesses”...
Page 88 IA-32 Intel® Architecture Optimization Assembly/Compiler Coding Rule 1. (MH impact, H generality) Arrange code to make basic blocks contiguous and eliminate unnecessary branches. For the Pentium M processor, every branch counts, even correctly predicted branches have a negative effect on the amount of useful code delivered to the processor.
Page 89: Example 2-1 Assembly Code With An Unpredictable Branch
Example 2-1 Assembly Code with an Unpredictable Branch A, B ebx, CONST1 L30: ebx, CONST2 L31: Example 2-2 Code Optimization to Eliminate Branches ebx, ebx A, B setge bl ebx, 1 ebx, CONST3 ebx, CONST2 See Example 2-2. The optimized code sets and B.
Page 90: Example 2-3 Eliminating Branch With Cmov Instruction
Pentium processors and earlier 32-bit Intel architecture processors. Be sure to check whether a processor supports these instructions with the Spin-Wait and Idle Loops The Pentium 4 processor introduces a new...
Page 91: Example 2-4 Use Of Pause Instruction
Branches that do not have a history in the BTB (see the “Branch Prediction” section) are predicted using a static prediction algorithm. The Pentium 4, Pentium M, Intel Core Solo and Intel Core Duo processors have similar static prediction algorithms: •...
Page 92: Example 2-5 Pentium 4 Processor Static Branch Prediction Algorithm
Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target.
Page 93: Example 2-7 Static Not-Taken Prediction Example
Examples 2-6, Example 2-7 provide basic rules for a static prediction algorithm. In Example 2-6, the backward branch ( first time through, therefore, the BTB does not issue a prediction. The static predictor, however, will predict the branch to be taken, so a misprediction will not occur.
Page 94: Inlining, Calls And Returns
Inlining, Calls and Returns The return address stack mechanism augments the static and dynamic predictors to optimize specifically for calls and returns. It holds 16 entries, which is large enough to cover the call depth of most programs. If there is a chain of more than 16 nested calls and more than 16 returns in rapid succession, performance may be degraded.
Page 95: Branch Type Selection
General Optimization Guidelines Assembly/Compiler Coding Rule 6. (H impact, M generality) Do not inline a function if doing so increases the working set size beyond what will fit in the trace cache. Assembly/Compiler Coding Rule 7. (ML impact, ML generality) If there are more than 16 nested calls and returns in rapid succession;...
Page 96 Placing data immediately following an indirect branch can cause a performance problem. If the data consist of all zeros, it looks like a long stream of adds to memory destinations, which can cause resource conflicts and slow down branch recovery. Also, the data immediately following indirect branches may appear as branches to the branch predication hardware, which can branch off to execute other data pages.
Page 97: Example 2-8 Indirect Branch With Two Favored Targets
indirect branch into a tree where one or more indirect branches are preceded by conditional branches to those targets. Apply this “peeling” procedure to the common target of an indirect branch that correlates to branch history. The purpose of this rule is to reduce the total number of mispredictions by enhancing the predictability of branches, even at the expense of adding more branches.
Page 98: Example 2-9 A Peeling Technique To Reduce Indirect Branch Misprediction
best performance from a coding effort. An example of peeling out the most favored target of an indirect branch with correlated branch history is shown in Example 2-9. Example 2-9 A Peeling Technique to Reduce Indirect Branch Misprediction function () int n = rand();...
Page 99 • The Pentium 4 processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional branches in the loop. Therefore, if the loop body size is not excessive, and the probable number of iterations is known, unroll inner loops until they have a maximum of 16 iterations.
Page 100: Compiler Support For Branch Prediction
Compiler Support for Branch Prediction Compilers can generate code that improves the efficiency of branch prediction in the Pentium 4 and Pentium M processors. The Intel C++ Compiler accomplishes this by: • keeping code and data on separate pages •...
Page 101: Memory Accesses
Misaligned data access can incur significant performance penalties. This is particularly true for cache line splits. The size of a cache line is 64 bytes in the Pentium 4, Intel Xeon, and Pentium M processors. On the Pentium 4 processor, an access to data unaligned on 64-byte boundary leads to two memory accesses and requires several µops to be...
Page 102 Assembly/Compiler Coding Rule 16. (H impact, H generality) Align data on natural operand size address boundaries. If the data will be accesses with vector instruction loads and stores, align the data on 16 byte boundaries. For best performance, align data as follows: •...
Page 103: Example 2-11 Code That Causes Cache Line Split
Example 2-11 Code That Causes Cache Line Split esi, 029e70feh edi, 05be5260h Blockmove: eax, DWORD PTR [esi] ebx, DWORD PTR [esi+4] DWORD PTR [edi], eax DWORD PTR [edi+4], ebx esi, 8 edi, 8 edx, 1 Blockmove Figure 2-1 Cache Line Split in Accessing Elements in a Array Address 029e70c1h Line 029e70c0h Line 029e7100h...
Page 104: Store Forwarding
Store Forwarding The processor’s memory system only sends stores to memory (including cache) after store retirement. However, store data can be forwarded from a store to a subsequent load from the same address to give a much shorter store-load latency. There are two kinds of requirements for store forwarding.
Page 105: Store-To-Load-Forwarding Restriction On Size And Alignment
Pentium M processors than that for Pentium 4 processors. This section describes these restrictions in all cases. It prescribes recommendations to prevent the non-forwarding penalty. Fixing this problem for Pentium 4 and Intel Xeon processors also fixes problem on Pentium M processors. 2-33...
Page 106: Figure 2-2 Size And Alignment Restrictions In Store Forwarding
The size and alignment restrictions for store forwarding are illustrated in Figure 2-2. Figure 2-2 Size and Alignment Restrictions in Store Forwarding (a) Sm all load after Large Store (b) Size of Load >= (c) Size of Load >= Store(s) (d) 128-bit Forward Must Be 16-Byte Aligned...
Page 107: Example 2-12 Several Situations Of Small Loads After Large Store
A load that forwards from a store must wait for the store’s data to be written to the store buffer before proceeding, but other, unrelated loads need not wait. Assembly/Compiler Coding Rule 20. (H impact, ML generality) If it is necessary to extract a non-aligned portion of stored data, read out the smallest aligned portion that completely contains the data and shift/mask the data as necessary.
Page 108: Example 2-15 Two Examples To Avoid The Non-Forwarding Situation In Example 2-14
Example 2-13 A Non-forwarding Example of Large Load After Small Store mov [EBP], mov [EBP + 1], ‘b’ mov [EBP + 2], ‘c’ mov [EBP + 3], ‘d’ mov EAX, [EBP] ; The first 4 small store can be consolidated into ;...
Page 109: Example 2-16 Large And Small Load Stalls
When moving data that is smaller than 64 bits between memory locations, 64-bit or 128-bit SIMD register moves are more efficient (if aligned) and can be used to avoid unaligned loads. Although floating-point registers allow the movement of 64 bits at a time, floating point instructions should not be used for this purpose, as data may be inadvertently modified.
Page 110: Store-Forwarding Restriction On Data Availability
However, the overall impact of this problem is much smaller than that from size and alignment requirement violations. The Pentium 4 and Intel Xeon processors predict when loads are both dependent on and get their data forwarded from preceding stores. These predictions can significantly improve performance.
Page 111: Data Layout Optimizations
An example of a loop-carried dependence chain is shown in Example 2-17. Example 2-17 An Example of Loop-carried Dependence Chain for (i=0; i<MAX; i++) { a[i] = b[i] * foo; foo = a[i]/3; Data Layout Optimizations User/Source Coding Rule 2. (H impact, M generality) Pad data structures defined in the source code so that every data element is aligned to a natural operand size address boundary.
Page 112: Example 2-19 Decomposing An Array
Cache line size for Pentium 4 and Pentium M processors can impact streaming applications (for example, multimedia). These reference and use data only once before discarding it. Data accesses which sparsely utilize the data within a cache line can result in less efficient utilization of system memory bandwidth.
Page 113 However, if the access pattern of the array exhibits locality, such as if the array index is being swept through, then the Pentium 4 processor prefetches data from struct_of_array structure are accessed together. When the elements of the structure are not accessed with equal frequency, such as when element the other entries, then struct_of_array...
Page 114: Stack Alignment
User/Source Coding Rule 3. (M impact, L generality) Beware of false sharing within a cache line (64 bytes) for Pentium 4, Intel Xeon, and Pentium M processors; and within a sector of 128 bytes on Pentium 4 and Intel Xeon processors.
Page 115: Capacity Limits And Aliasing In Caches
Note that first-level cache lines are 64 bytes. Thus the least significant 6 bits are not considered in alias comparisons. For the Pentium 4 and Intel Xeon processors, data is loaded into the second level cache in a sector of 128 bytes, so the least significant 7 bits are not considered in alias comparisons.
Page 116: Capacity Limits In Set-Associative Caches
On Pentium 4 and Intel Xeon processors with CPUID signature of family encoding 15, model...
Page 117: Aliasing Cases In The Pentium ® 4 And Intel ® Xeon ® Processors
Aliasing Cases in the Pentium Processors Aliasing conditions that are specific to the Pentium 4 processor and Intel Xeon processor are: • 16K for code – there can only be one of these in the trace cache at a time. If two traces whose starting addresses are 16K apart are in the same working set, the symptom will be a high trace cache miss rate.
Page 118: Aliasing Cases In The Pentium M Processor
Aliasing Cases in the Pentium M Processor Pentium M, Intel Core Solo and Intel Core Duo processors have the following aliasing case: • Store forwarding - If there has been a store to an address followed by a load to the same address within a short time window, the load will not proceed until the store data is available.
Page 119: Mixing Code And Data
1 KB subpages. Self-modifying Code Self-modifying code (SMC) that ran correctly on Pentium III processors and prior implementations will run correctly on subsequent implementations, including Pentium 4 and Intel Xeon processors. SMC General Optimization Guidelines pause 2-47...
Page 120: Write Combining
Saving traffic is particularly important for avoiding partial writes to uncached memory. There are six write-combining buffers (on Pentium 4 and Intel Xeon processors with CPUID signature of family encoding 15, model encoding 3, there are 8 write-combining buffers). Two of these buffers...
Page 121 General Optimization Guidelines write misses; only four write-combining buffers are guaranteed to be available for simultaneous use. Write combining applies to memory type WC; it does not apply to memory type UC. Assembly/Compiler Coding Rule 28. (H impact, L generality) If an inner loop writes to more than four arrays, (four distinct cache lines), apply loop fission to break up the body of the loop such that only four arrays are being written to in each iteration of each of the resulting loops.
Page 122: Locality Enhancement
RFO since the line is not cached, and there is no such delay. For details on write-combining, see the Intel Architecture Software Devel- oper’s Manual. Locality Enhancement Locality enhancement can reduce data traffic originating from an outer-level sub-system in the cache/memory hierarchy, this is to address the fact that the access-cost in terms of cycle-count from an outer level will be more expensive than from an inner level.
Page 123 Locality enhancement to the last level cache can be accomplished with sequencing the data access pattern to take advantage of hardware prefetching. This can also take several forms: • Transformation of a sparsely populated multi-dimensional array into a one-dimension array such that memory references occur in a sequential, small-stride prefetch.
Page 124: Minimizing Bus Latency
Minimizing Bus Latency The system bus on Intel Xeon and Pentium 4 processors provides up to 6.4 GB/sec bandwidth of throughput at 200 MHz scalable bus clock rate. (See MSR_EBC_FREQUENCY_ID register.) The peak bus bandwidth is even higher with higher bus clock rates.
Page 125: Non-Temporal Store Bus Traffic
General Optimization Guidelines User/Source Coding Rule 8. (H impact, H generality) To achieve effective amortization of bus latency, software should pay attention to favor data access patterns that result in higher concentrations of cache miss patterns with cache miss strides that are significantly smaller than half of the hardware prefetch trigger threshold.
Page 126: Example 2-22 Non-Temporal Stores And Partial Bus Write Transactions
Example 2-21 Non-temporal Stores and 64-byte Bus Write Transactions #define STRIDESIZE 256 Lea ecx, p64byte_Aligned Mov edx, ARRAY_LEN Xor eax, eax slloop: movntps XMMWORD ptr [ecx + eax], xmm0 movntps XMMWORD ptr [ecx + eax+16], xmm0 movntps XMMWORD ptr [ecx + eax+32], xmm0 movntps XMMWORD ptr [ecx + eax+48], xmm0 ;...
Page 127: Prefetching
64-bytes into the first-level data cache without polluting the second-level cache. Intel Core Solo and Intel Core Duo processors provide more advanced hardware prefetchers for data relative to those on the Pentium M processors. The key differences are summarized in Table 1-2.
Page 128: Cacheability Instructions
access patterns to suit the hardware prefetcher is highly recommended, and should be a higher-priority consideration than using software prefetch instructions. The hardware prefetcher is best for small-stride data access patterns in either direction with cache-miss stride not far from 64 bytes. This is true for data accesses to addresses that are either known or unknown at the time of issuing the load operations.
Page 129: Code Alignment
Because the trace cache (TC) removes the decoding stage from the pipeline for frequently executed code, optimizing code alignment for decoding is not as important for Pentium 4 and Intel Xeon processors. For the Pentium M processor, code alignment and the alignment of branch target will affect the throughput of the decoder.
Page 130: Guidelines For Optimizing Floating-Point Code
Guidelines for Optimizing Floating-point Code User/Source Coding Rule 10. (M impact, M generality) Enable the compiler’s use of SSE, SSE2 or SSE3 instructions with appropriate switches. Follow this procedure to investigate the performance of your floating-point application: • Understand how the compiler handles floating-point code. •...
Page 131 to early out). However, be careful of introducing more than a total of two values for the floating point control word, or there will be a large performance penalty. See “Floating-point Modes”. User/Source Coding Rule 13. (H impact, ML generality) Use fast float-to-int routines, FISTTP, or SSE2 instructions.
Page 132: Floating-Point Modes And Exceptions
• arithmetic underflow • denormalized operand Refer to Chapter 4 of the IA-32 Intel® Architecture Software Developer’s Manual, Volume 1 for the definition of overflow, underflow and denormal exceptions. Denormalized floating-point numbers impact performance in two ways: •...
Page 133 executing SSE/SSE2/SSE3 instructions and when speed is more important than complying to IEEE standard. The following paragraphs give recommendations on how to optimize your code to reduce performance degradations related to floating-point exceptions. Dealing with floating-point exceptions in x87 FPU code Every special situation listed in the “Floating-point Exceptions”...
Page 134: Floating-Point Modes
Underflow exceptions and denormalized source operands are usually treated according to the IEEE 754 specification. If a programmer is willing to trade pure IEEE 754 compliance for speed, two non-IEEE 754 compliant modes are provided to speed situations where underflows and input are frequent: FTZ mode and DAZ mode.
Page 135 FPU control word (FCW), such as when performing conversions to integers. On Pentium M, Intel Core Solo and Intel Core Duo processors; is improved over previous generations. FLDCW Specifically, the optimization for alternate between two constant values efficiently. For the...
Page 136: Rounding Mode
Assembly/Compiler Coding Rule 31. (H impact, M generality) Minimize changes to bits 8-12 of the floating point control word. Changes for more than two values (each value being a combination of the following bits: precision, rounding and infinity control, and the rest of bits in FCW) leads to delays that are on the order of the pipeline depth.
Page 137 General Optimization Guidelines If there is more than one change to rounding, precision and infinity bits and the rounding mode is not important to the result; use the algorithm in Example 2-23 to avoid synchronization issues, the overhead of the instruction and having to change the rounding mode.
Page 138: Example 2-23 Algorithm To Avoid Changing The Rounding Mode
Example 2-23 Algorithm to Avoid Changing the Rounding Mode _fto132proc ecx,[esp-8] esp,16 ecx,-8 st(0) fistp qword ptr[ecx] fild qword ptr[ecx] edx,[ecx+4]; high dword of integer eax,[ecx] test eax,eax integer_QnaN_or_zero arg_is_not_integer_QnaN: fsubp st(1),st test edx,edx positive fstp dword ptr[ecx]; result of subtraction ecx,[ecx] esp,16 ecx,80000000h...
Page 139 Example 2-23 Algorithm to Avoid Changing the Rounding Mode (continued) positive: fstp dword ptr[ecx] ; 17-18 result of subtraction ecx,[ecx] esp,16 ecx,7fffffffh eax,0 integer_QnaN_or_zero: test edx,7fffffffh arg_is_not_integer_QnaN add esp,16 Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to the rounding mode. Do not use changes in the rounding mode to implement the floor and ceiling functions if this involves a total of more than two values of the set of rounding, precision and infinity bits.
Page 140: Improving Parallelism And The Use Of Fxch
Assembly/Compiler Coding Rule 33. (H impact, L generality) Minimize the number of changes to the precision mode. Improving Parallelism and the Use of FXCH The x87 instruction set relies on the floating point stack for one of its operands. If the dependence graph is a tree, which means each intermediate result is used only once and code is scheduled carefully, it is often possible to use only operands that are on the top of the stack or in memory, and to avoid using operands that are buried under the top of...
Page 141: X87 Vs. Scalar Simd Floating-Point Trade-Offs
This in turn allows instructions to be reordered to make instructions available to be executed in parallel. Out-of-order execution precludes the need for using x87 vs. Scalar SIMD Floating-point Trade-offs There are a number of differences between x87 floating-point code and scalar floating-point code (using SSE and SSE2).
Page 142: Scalar Sse/Sse2 Performance On Intel Core Solo And Intel Core Duo Processors
Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core Duo Processors On Intel Core Solo and Intel Core Duo processors, the combination of improved decoding and micro-op fusion allows instructions which were formerly two, three, and four micro-ops to go through all decoders. As a result, scalar SSE/SSE2 code can match the performance of x87 code executing through two floating-point units.
Page 143: Memory Operands
On Pentium M, Intel Core Solo and Intel Core Duo processors; this penalty can be avoided by using movlpd. However, using movlpd causes performance penalty on Pentium 4 processors.
Page 144: Floating-Point Stalls
Floating-Point Stalls Floating-point instructions have a latency of at least two cycles. But, because of the out-of-order nature of Pentium II and the subsequent processors, stalls will not necessarily occur on an instruction or µop basis. However, if an instruction has a very long latency such as an , then scheduling can improve the throughput of the overall fdiv application.
Page 145: Instruction Selection
Note that transcendental functions are supported only in x87 floating point, not in Streaming SIMD Extensions or Streaming SIMD Extensions 2. Instruction Selection This section explains how to generate optimal assembly code. The listed optimizations have been shown to contribute to the overall performance at the application level on the order of 5%.
Page 146: Complex Instructions
Complex Instructions Assembly/Compiler Coding Rule 40. (ML impact, M generality) Avoid using complex instructions (for example, more than four µops and require multiple cycles to decode. Use sequences of simple instructions instead. Complex instructions may save architectural registers, but incur a penalty of 4 µops to set up parameters for the microcode ROM.
Page 147: Use Of The Inc And Dec Instructions
Use of the inc and dec Instructions register. This creates a dependence on all previous writes of the flag register. This is especially problematic when these instructions are on the critical path because they are used to change an address for a load on which many other instructions depend.
Page 148: Table 2-2 Avoiding Partial Flag Register Stall
Operand Sizes and Partial Register Accesses The Pentium 4 processor, Pentium M processor (with CPUID signature family 6, model 13), Intel Core Solo and Intel Core Duo processors do not incur a penalty for partial register accesses; Pentium M processor...
Page 149: Example 2-24 Dependencies Caused By Referencing Partial Registers
(model 9) does incur a penalty. This is because every operation on a partial register updates the whole register. However, this does mean that there may be false dependencies between any references to partial registers. Example 2-24 demonstrates a series of false and real dependencies caused by referencing partial registers.
Page 150: Table 2-3 Avoiding Partial Register Stall When Packing Byte Values
Table 2-3 illustrates using packing three byte values into a register. Table 2-3 Avoiding Partial Register Stall When Packing Byte Values A Sequence with Partial Register Stall mov al,byte ptr a[2] shl eax,16 mov ax,word ptr a movd mm0,eax Assembly/Compiler Coding Rule 44. (ML impact, L generality) Use simple instructions that are less than eight bytes in length.
Page 151 less delay than the partial register update problem mentioned above, but the performance gain may vary. If the additional μop is a critical problem, can sometimes be used as alternative. movsx Sometimes sign-extended semantics can be maintained by zero-extending operands. For example, the C code in the following statements does not need sign extension, nor does it need prefixes for operand size overrides: static short int a, b;...
Page 152: Prefixes And Instruction Decoding
FF). Use of an LCP causes a change in the number of bytes to encode the displacement operand in the instruction. On Pentium M, Intel Core Solo and Intel Core Duo processors; the following situation causes extra delays when decoding an instruction with an LCP: •...
Page 153: Rep Prefix And Data Movement
• Processing an instruction with the 0x66 prefix that (i) has a modr/m byte in its encoding and (ii) the opcode byte of the instruction happens to be aligned on byte 14 of an instruction fetch line. The performance delay in this case is approximately twice of those other two situations.
Page 154 String move/store instructions have multiple data granularities. For efficient data movement, larger data granularities are preferable. This means better efficiency can be achieved by decomposing an arbitrary counter value into a number of doublewords plus single byte moves with a count value less or equal to 3. Because software can use SIMD data movement instructions to move 16 bytes at a time, the following paragraphs discuss general guidelines for designing and implementing high-performance library functions such as...
Page 155 For cases N < a small count, where the small count threshold will vary between microarchitectures (empirically, 8 may be a good value when optimizing for Intel NetBurst microarchitecture). Each case can be coded directly without the overhead of a looping structure.
Page 156 improve address alignment, a small piece of prolog code using movsb/stosb with count less than 4 can be used to peel off the non-aligned data moves before starting to use movsd/stosd. • For cases where N is less than half the size of last level cache, throughput consideration may favor either: (a) an approach using REP string with the largest data granularity because REP string has little overhead for loop iteration, and the branch misprediction...
Page 157: Table 2-5 Using Rep Stosd With Arbitrary Count Size And 4-Byte-Aligned Destination
(i=0;i<size;i++) *d++ = (char)c; Memory routines in the runtime library generated by Intel Compilers are optimized across wide range of address alignment, counter values, and microarchitectures. In most cases, applications should take advantage of the default memory routines provided by Intel Compilers.
Page 158: Address Calculations
In some situations, the byte count of the data to operate is known by the context (versus from a parameter passed from a call). One can take a simpler approach than those required for a general-purpose library routine. For example, if the byte count is also small, using rep movsb/stosb with count less than four can ensure good address alignment and loop-unrolling to finish the remaining data;...
Page 159: Clearing Registers
The xorps xorpd cannot be used to break dependence chains. In Intel Core Solo and Intel Core Duo processors; the instructions can be used to clear execution dependencies on the pxor zero evaluation of the destination register.
Page 160: Floating Point/Simd Operands
Often a produced value must be compared with zero, and then used in a branch. Because most Intel architecture instructions set the condition codes as part of their execution, the compare instruction may be eliminated.
Page 161 as an alternative; it writes all 128 bits. Even though this movapd instruction has a longer latency, the μops for execution port and this port is more likely to be free. The change can impact performance. There may be exceptional cases where the latency matters more than the dependence or the execution port.
Page 162: Prolog Sequences
Prolog Sequences Assembly/Compiler Coding Rule 57. (M impact, MH generality) In routines that do not need a frame pointer and that do not have called routines that modify does not apply in the following cases: a routine is called that leaves modified upon return, for example, structured or C++ style exception handling;...
Page 163: Instruction Scheduling
Example 2-25 Recombining LOAD/OP Code into REG,MEM Form LOAD reg1, mem1 ... code that does not write to reg1... reg2, reg1 ... code that does not use reg1 ... Using memory as a destination operand may further reduce register pressure at the slight risk of making trace cache packing more difficult. On the Pentium 4 processor, the sequence of loading a value from memory into a register and adding the results in a register to memory is faster than the alternate sequence of adding a value from memory to a...
Page 164: Spill Scheduling
Scheduling Rules for the Pentium 4 Processor Decoder The Pentium 4 and Intel Xeon processors have a single decoder that can decode instructions at the maximum rate of one instruction per clock.
Page 165: Scheduling Rules For The Pentium M Processor Decoder
Because micro-ops are delivered from the trace cache in the common cases, decoding rules are not required. Scheduling Rules for the Pentium M Processor Decoder The Pentium M processor has three decoders, but the decoding rules to supply micro-ops at high bandwidth are less stringent than those of the Pentium III processor.
Page 166 Extensions 2. Thus the vector length ranges from 2 to 16, depending on the instruction extensions used and on the data type. The Intel C++ Compiler supports vectorization in three ways: • The compiler may be able to generate SIMD code without intervention from the user.
Page 167: Miscellaneous
User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches. User/Source Coding Rule 20. (M impact, ML generality) Keep induction (loop) variables expressions simple. Miscellaneous This section explains separate guidelines that do not belong to any category described above.
Page 168: Summary Of Rules And Suggestions
The other NOPs have no special hardware support. Their input and output registers are interpreted by the hardware. Therefore, a code generator should arrange to use the register containing the oldest value as input, so that the NOP will dispatch and release RS resources at the earliest possible opportunity.
Page 169: User/Source Coding Rules
User/Source Coding Rule 3. (M impact, L generality) Beware of false sharing within a cache line (64 bytes) for both Pentium 4, Intel Xeon, and Pentium M processors; and within a sector of 128 bytes on Pentium 4 and Intel Xeon processors. 2-42 User/Source Coding Rule 4.
Page 170 User/Source Coding Rule 8. (H impact, H generality) To achieve effective amortization of bus latency, software should.pay attention to favor data access patterns that result in higher concentrations of cache miss patterns with cache miss strides that are significantly smaller than half of the hardware prefetch trigger threshold.
Page 171: Assembly/Compiler Coding Rules
look-up-table-based algorithm using interpolation techniques. It is possible to improve transcendental performance with these techniques by choosing the desired numeric precision, the size of the look-up tableland taking advantage of the parallelism of the Streaming SIMD Extensions and the Streaming SIMD Extensions 2 instructions.
Page 172 IA-32 Intel® Architecture Optimization order engine. When tuning, note that all IA-32 based processors have very high branch prediction rates. Consistently mispredicted are rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch. 2-16 Assembly/Compiler Coding Rule 3.
Page 173 General Optimization Guidelines Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in 16-byte chunks. 2-22 Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not put more than two end loop branches in a 16-byte chunk. 2-22 Assembly/Compiler Coding Rule 12.
Page 174 IA-32 Intel® Architecture Optimization Assembly/Compiler Coding Rule 18. (H impact, M generality) A load that forwards from a store must have the same address start point and therefore the same alignment as the store data. 2-34 Assembly/Compiler Coding Rule 19. (H impact, M generality) The data of a load which is forwarded from a store must be completely contained within the store data.
Page 175 General Optimization Guidelines first-level cache working set. Avoid having more than 8 cache lines that are some multiple of 64 KB apart in the same second-level cache working set. Avoid having a store followed by a non-dependent load with addresses that differ by a multiple of 4 KB.
Page 176 IA-32 Intel® Architecture Optimization Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to the rounding mode. Do not use changes in the rounding mode to implement the floor and ceiling functions if this involves a total of more than two values of the set of rounding, precision and infinity bits.
Page 177 Assembly/Compiler Coding Rule 42. (M impact, H generality) instructions should be replaced with an because overwrite all flags, whereas inc and dec do not, therefore creating false dependencies on earlier instructions that set the flags. 2-73 Assembly/Compiler Coding Rule 43. (ML impact, L generality) Avoid by register or rotate rotate...
Page 178 instead of a zero and saves encoding space. Avoid comparing a constant to a memory operand. It is preferable to load the memory operand and compare the constant to a register. 2-79 Assembly/Compiler Coding Rule 51. (ML impact, M generality) Eliminate unnecessary compare with zero instructions by using the appropriate conditional jump instruction when the flags are already set by a preceding arithmetic instruction.
Page 179 General Optimization Guidelines Assembly/Compiler Coding Rule 56. (M impact, ML generality) For arithmetic or logical operations that have their source operand in memory and the destination operand is in a register, attempt a strategy that initially loads the memory operand to a register followed by a register to register ALU operation.
Page 180: Tuning Suggestions
Tuning Suggestions Tuning Suggestion 1. Rarely, a performance problem may be noted due to executing data on a code page as instructions. The only condition where this is likely to happen is following an indirect branch that is not resident in the trace cache. If a performance problem is clearly due to this problem, try moving the data elsewhere, or inserting an illegal opcode or instruction immediately following the indirect branch.
Page 181: Chapter 3 Coding For Simd Architectures
Coding for SIMD Architectures Intel Pentium 4, Intel Xeon and Pentium M processors include support for Streaming SIMD Extensions 2 (SSE2), Streaming SIMD Extensions technology (SSE), and MMX technology. In addition, Streaming SIMD Extensions 3 (SSE3) were introduced with the Pentium 4 processor supporting Hyper-Threading Technology at 90 nm technology.
Page 182: Checking For Processor Support Of Simd Technologies
Checking for Processor Support of SIMD Technologies This section shows how to check whether a processor supports MMX technology, SSE, SSE2, or SSE3. SIMD technology can be included in your application in three ways: Check for the SIMD technology during installation. If the desired SIMD technology is available, the appropriate DLLs can be installed.
Page 183: Example 3-1 Identification Of Mmx Technology With Cpuid
Example 3-2 shows how to find the SSE feature bit (bit 25) in the feature flags. Coding for SIMD Architectures ; identify signature is genuine intel ; request for feature flags ; 0Fh, 0A2h cpuid instruction ; is MMX technology bit (bit ;...
Page 184: Example 3-2 Identification Of Sse With Cpuid
__asm xorps xmm0, xmm0 ;Streaming SIMD Extension _except(EXCEPTION_EXECUTE_HANDLER) { if (_exception_code()==STATUS_ILLEGAL_INSTRUCTION) /* SSE are supported by OS */ return (true); ; identify signature is genuine intel ; request for feature flags ; 0Fh, 0A2h ; bit 25 in feature flags equal to 1 Found /* SSE not supported */ return (false);...
Page 185: Example 3-4 Identification Of Sse2 With Cpuid
See Example 3-5. Coding for SIMD Architectures instruction. cpuid for SSE2 technology existence. cpuid ; identify signature is genuine intel ; request for feature flags ; 0Fh, 0A2h cpuid instruction ; bit 26 in feature flags equal to 1...
Page 186: Example 3-5 Identification Of Sse2 By The Os
Example 3-5 Identification of SSE2 by the OS bool OSSupportCheck() { _try { __asm xorpd xmm0, xmm0 ; SSE2} _except(EXCEPTION_EXECUTE_HANDLER) { if _exception_code()==STATUS_ILLEGAL_INSTRUCTION) /* SSE2 are supported by OS */ Checking for Streaming SIMD Extensions 3 Support SSE3 includes 13 instructions, 11 of those are suited for SIMD or x87 style programming.
Page 187: Example 3-6 Identification Of Sse3 With Cpuid
MONITOR and MWAIT can be done by executing the MONITOR execution and trap for an exception similar to the sequence shown in Example 3-7. Coding for SIMD Architectures ; identify signature is genuine intel ; request for feature flags ; 0Fh, 0A2h cpuid instruction...
Page 188: Example 3-7 Identification Of Sse3 By The Os
Example 3-7 Identification of SSE3 by the OS bool SSE3_SIMD_SupportCheck() { _try { __asm addsubpd xmm0, xmm0 ; SSE3} _except(EXCEPTION_EXECUTE_HANDLER) { if _exception_code()==STATUS_ILLEGAL_INSTRUCTION) /* SSE3 SIMD and FISTTP instructions are supported */ Considerations for Code Conversion to SIMD Programming The VTune Performance Enhancement Environment CD provides tools to aid in the evaluation and tuning.
Page 189: Figure 3-1 Converting To Streaming Simd Extensions Chart
Figure 3-1 Converting to Streaming SIMD Extensions Chart Floating Point W hy FP? Range or Precision Can convert to Integer? Can convert to Single-precision? Identify Hot Spots in Code Code benefits from SIMD Integer or floating-point? Perform ance Change to use SIMD Integer Change to use Single Precision...
Page 190: Identifying Hot Spots
To use any of the SIMD technologies optimally, you must evaluate the following situations in your code: • fragments that are computationally intensive • fragments that are executed often enough to have an impact on performance • fragments that with little data-dependent control flow •...
Page 191: Determine If Code Benefits By Conversion To Simd Execution
Intel analyzer is designed specifically for all of the Intel architecture (IA)-based processors, including the Pentium 4 processor, it can offer these detailed approaches to working with IA. See “Code Optimization Options”...
Page 192: Coding Techniques
XMM registers). • Re-code the loop with the SIMD instructions. Each of these actions is discussed in detail in the subsequent sections of this chapter. These sections also discuss enabling automatic vectorization via the Intel C++ Compiler. 3-12...
Page 193: Coding Methodologies
Coding Methodologies Software developers need to compare the performance improvement that can be obtained from assembly code versus the cost of those improvements. Programming directly in assembly language for a target platform may produce the required performance gain, however, assembly code is not portable between processor architectures and is expensive to write and maintain.
Page 194: Example 3-8 Simple Four-Iteration Loop
The examples that follow illustrate the use of coding adjustments to enable the algorithm to benefit from the SSE. The same techniques may be used for single-precision floating-point, double-precision floating-point, and integer data under SSE2, SSE, and MMX technology. As a basis for the usage model discussed in this section, consider a simple loop shown in Example 3-8.
Page 195: Assembly
XMMWORD PTR [ecx], xmm0 Intrinsics Intrinsics provide the access to the ISA functionality using C/C++ style coding instead of assembly language. Intel has defined three sets of intrinsic functions that are implemented in the Intel support the MMX technology, Streaming SIMD Extensions and Streaming SIMD Extensions 2.
Page 196: Example 3-10 Simple Four-Iteration Loop Coded With Intrinsics
The intrinsics map one-to-one with actual Streaming SIMD Extensions assembly code. The for the intrinsics are defined is part of the Intel C++ Compiler included with the VTune Performance Enhancement Environment CD. Intrinsics are also defined for the MMX technology ISA. These are...
Page 197: Classes
C++ classes, the performance of applications using this methodology can approach that of one using the intrinsics. Further details on the use of these classes can be found in the Intel C++ Class Libraries for SIMD Operations User’s Guide, order number 693500.
Page 198: Automatic Vectorization
Again, the example is assuming the arrays, passed to the routine, are already aligned to 16-byte boundary. Automatic Vectorization The Intel C++ Compiler provides an optimization mechanism by which loops, such as in Example 3-8 can be automatically vectorized, or converted into Streaming SIMD Extensions code. The compiler uses similar techniques to those used by a programmer to identify whether a loop is suitable for conversion to SIMD.
Page 199: Example 3-12 Automatic Vectorization For A Simple Loop
(See documentation for the Intel C++ Compiler). The restrict keyword avoids the associated overhead altogether. Refer to the Intel® C++ Compiler User’s Guide for details. Coding for SIMD Architectures switches of the Intel...
Page 200: Stack And Data Alignment
Stack and Data Alignment To get the most performance out of code written for SIMD technologies data should be formatted in memory according to the guidelines described in this section. Assembly code with an unaligned accesses is a lot slower than an aligned access. Alignment and Contiguity of Data Access Patterns The 64-bit packed data types defined by MMX technology, and the 128-bit packed data types for Streaming SIMD Extensions and...
Page 201: Using Arrays To Make Data Contiguous
By adding the padding variable the first element is aligned to 8 bytes (64 bits), all following elements will also be aligned. The sample declaration follows: typedef struct { short x,y,z; char a; char pad; } Point; Point pt[N]; Using Arrays to Make Data Contiguous In the following code, for (i=0;...
Page 202: Stack Alignment For 128-Bit Simd Technologies
IA-32 ( implemented in most compilers, do not provide any mechanism for ensuring that certain local data and certain parameters are 16-byte aligned. Therefore, Intel has defined a new set of IA-32 software conventions for alignment to support the new __m128...
Page 203: Data Alignment For Mmx Technology
“holes” (due to padding) in the argument block. These new conventions presented in this section as implemented by the Intel C++ Compiler can be used as a guideline for an assembly language code as well. In many cases, this section assumes the use of the data types, as defined by the Intel C++ Compiler, which represents an array of four 32-bit floats.
Page 204: Data Alignment For 128-Bit Data
8-byte alignment. The following discussion and examples describe alignment techniques for Pentium 4 processor as implemented with the Intel C++ Compiler. Compiler-Supported Alignment The Intel C++ Compiler provides the following methods to ensure that the data is aligned. Alignment by F32vec4...
Page 205 __declspec(align(16)) declarations to force 16-byte alignment. This is particularly useful for local or global data declarations that are assigned to 128-bit data types. The syntax for it is __declspec(align(integer-constant)) where the integer-constant than 32. For example, the following increases the alignment to 16-bytes: __declspec(align(16)) float buffer[400];...
Page 206 128-bit data. The default behavior is to use to align routines with 8- or 16-byte data types to 16-bytes. For more details, see relevant Intel application notes in the Intel Architecture Performance Training Center provided with the SDK and the Intel® C++ Compiler User’s Guide.
Page 207: Improving Memory Utilization
Improving Memory Utilization Memory performance can be improved by rearranging data and algorithms for SSE 2, SSE, and MMX technology intrinsics. The methods for improving memory performance involve working with the following: • Data structure layout • Strip-mining for vectorization and memory utilization •...
Page 208: Example 3-16 Aos And Soa Code Samples
SoA Data Structure Example 3-15 typedef struct{ float x[NumOfVertices]; float y[NumOfVertices]; float z[NumOfVertices]; int a[NumOfVertices]; int b[NumOfVertices]; int c[NumOfVertices]; . . . } VerticesList; VerticesList Vertices; There are two options for computing data in AoS format: perform operation on the data as it stands in AoS format, or re-arrange it (swizzle it) into SoA format dynamically.
Page 209 Example 3-16 AoS and SoA Code Samples (continued) addps xmm1, xmm0 movaps xmm2, xmm1 shufps xmm2, xmm2,55h addps xmm2, xmm1 ; SoA code ; X = x0,x1,x2,x3 ; Y = y0,y1,y2,y3 ; Z = z0,z1,z2,z3 ; A = xF,xF,xF,xF ; B = yF,yF,yF,yF ;...
Page 210 but is somewhat inefficient as there is the overhead of extra instructions during computation. Performing the swizzle statically, when the data structures are being laid out, is best as there is no runtime overhead. As mentioned earlier, the SoA arrangement allows more efficient use of the parallelism of the SIMD technologies because the data is ready for computation in a more optimal vertical manner: multiplying components...
Page 211 Note that SoA can have the disadvantage of requiring more independent memory stream references. A computation that uses arrays Example 3-15 would require three separate data streams. This can require the use of more prefetches, additional address generation calculations, as well as having a greater impact on DRAM page access efficiency.
Page 212: Example 3-18 Pseudo-Code Before Strip Mining
Strip Mining Strip mining, also known as loop sectioning, is a loop transformation technique for enabling SIMD-encodings of loops, as well as providing a means of improving memory performance. First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine.
Page 213: Example 3-19 Strip Mined Code
Example 3-18 Pseudo-code Before Strip Mining (continued) for (i=0; i<Num; i++) { Lighting(v[i]); The main loop consists of two functions: transformation and lighting. For each object, the main loop calls a transformation routine to update some data, then calls the lighting routine to further work on the data. If the size of array that were cached during v[i]...
Page 214: Loop Blocking
In Example 3-19, the computation has been strip-mined to a size . The value strip_size elements of array given element still be in the cache when we perform improve performance over the non-strip-mined code. Loop Blocking Loop blocking is another useful technique for memory performance optimization.
Page 215: Example 3-20 Loop Blocking
Example 3-20 Loop Blocking A. Original Loop float A[MAX, MAX], B[MAX, MAX] for (i=0; i< MAX; i++) { for (j=0; j< MAX; j++) { A[i,j] = A[i,j] + B[j, i]; B. Transformed Loop after Blocking float A[MAX, MAX], B[MAX, MAX]; for (i=0;...
Page 216: Figure 3-3 Loop Blocking Access Pattern
This situation can be avoided if the loop is blocked with respect to the cache size. In Figure 3-3, a factor. Suppose that array will be eight cache lines (32 bytes each). In the first iteration of the inner loop, A[0, 0:7] will be completely consumed by the first iteration of the B[0, 0:7]...
Page 217: Instruction Selection
As one can see, all the redundant cache misses can be eliminated by applying this loop blocking technique. If also help reduce the penalty from DTLB (data translation look-aside buffer) misses. In addition to improving the cache/memory performance, this optimization technique also saves external bus bandwidth.
Page 218: Simd Optimizations And Microarchitectures
However, the consumers should not be scheduled near the producer. SIMD Optimizations and Microarchitectures Pentium M, Intel Core Solo and Intel Core Duo processors have a different microarchitecture than Intel NetBurst following sub-section discusses optimizing SIMD code targeting Intel Core Solo and Intel Core Duo processors.
Page 219: Tuning The Final Application
Using the VTune analyzer can help you with various phases required for optimized performance. See “Intel® VTune™ Performance Analyzer” in Appendix A for more details on how to use the VTune analyzer. After every effort to optimize, you should check the performance gains to see where you are making your major optimization gains.
Page 220 IA-32 Intel® Architecture Optimization 3-40...
Page 221: Chapter 4 Optimizing For Simd Integer Applications
The SIMD integer instructions provide performance improvements in applications that are integer-intensive and can take advantage of the SIMD architecture of Pentium 4, Intel Xeon, and Pentium M processors. The guidelines for using these instructions in addition to the guidelines...
Page 222: General Rules On Simd Integer Code
SIMD data in the XMM register is strongly discouraged. • Use the optimization rules and guidelines described in Chapter 2 and Chapter 3 that apply to the Pentium 4, Intel Xeon and Pentium M processors. • Take advantage of hardware prefetcher where possible. Use prefetch instruction only when data access patterns are irregular and prefetch distance can be pre-determined.
Page 223: Using Simd Integer With X87 Floating-Point
Using SIMD Integer with x87 Floating-point All 64-bit SIMD integer instructions use the MMX registers, which share register state with the x87 floating-point stack. Because of this sharing, certain rules and considerations apply. Instructions which use the MMX registers cannot be freely intermixed with x87 floating-point registers.
Page 224: Guidelines For Using Emms Instruction
Using clears all of the valid bits, effectively emptying the x87 emms floating-point stack and making it ready for new x87 floating-point operations. The using operations on the MMX registers and using operations on the x87 floating-point stack. On the Pentium 4 processor, there is a finite overhead for using the Failure to use the between operations on the MMX registers and operations on the x87...
Page 225: Example 4-1 Resetting The Register Between __M64 And Fp Data Types
__m64 x = _m_paddd(y, z); float f = init(); Further, you must be aware that your code generates an MMX instruction, which uses the MMX registers with the Intel C++ Compiler, in the following situations: • when using a 64-bit SIMD integer intrinsic from MMX technology, SSE, or SSE2 •...
Page 226: Data Alignment
Data Alignment Make sure that 64-bit SIMD integer data is 8-byte aligned and that 128-bit SIMD integer data is 16-byte aligned. Referencing unaligned 64-bit SIMD integer data can incur a performance penalty due to accesses that span 2 cache lines. Referencing unaligned 128-bit SIMD integer data will result in an exception unless the double-quadword unaligned) instruction is used.
Page 227: Signed Unpack
Example 4-2 Unsigned Unpack Instructions ; Input: MM7 0 ; Output: movq MM1, MM0 punpcklwd MM0, MM7 punpckhwd MM1, MM7 Signed Unpack Signed numbers should be sign-extended when unpacking the values. This is similar to the zero-extend shown above except that the instruction (packed shift right arithmetic) is used to effectively sign extend the values.
Page 228: Example 4-3 Signed Unpack Code
Example 4-3 Signed Unpack Code ; Input: ; Output: movq MM1, MM0 punpcklwd MM0, MM0 punpckhwd MM1, MM1 psrad MM0, 16 source psrad MM1, 16 Interleaved Pack with Saturation The pack instructions pack two values into the destination register in a predetermined order.
Page 229: Figure 4-2 Interleaved Pack With Saturation
Figure 4-1 PACKSSDW mm, mm/mm64 Instruction Example m m /m 64 Figure 4-2 illustrates two values interleaved in the destination register, and Example 4-4 shows code that uses the operation. The two signed doublewords are used as source operands and the result is interleaved signed words.
Page 230: Example 4-4 Interleaved Pack With Saturation
16-bit values of the two sources into eight saturated eight-bit unsigned values in the destination. A complete specification of the MMX instruction set can be found in the Intel Architecture MMX Technology Programmer’s Reference Manual, order number 243007.
Page 231: Example 4-5 Interleaved Pack Without Saturation
Example 4-5 Interleaved Pack without Saturation ; Input: ; Output: pslld MM1, 16 pand MM0, {0,ffff,0,ffff} MM0, MM1 Non-Interleaved Unpack The unpack instructions perform an interleave merge of the data elements of the destination and source operands into the destination register.
Page 232: Figure 4-4 Result Of Non-Interleaved Unpack High In Mm1
Figure 4-3 Result of Non-Interleaved Unpack Low in MM0 The other destination register will contain the opposite combination illustrated in Figure 4-4. Figure 4-4 Result of Non-Interleaved Unpack High in MM1 Code in the Example 4-6 unpacks two packed-word sources in a non-interleaved way.
Page 233: Example 4-6 Unpacking Two Packed-Word Sources In A Non-Interleaved Way
Example 4-6 Unpacking Two Packed-word Sources in a Non-interleaved Way ; Input: ; Output: movq MM2, MM0 punpckldq MM0, MM1 punpckhdq MM2, MM1 Extract Word instruction takes the word in the designated MMX register pextrw selected by the two least significant bits of the immediate value and moves it to the lower half of a 32-bit integer register, see Figure 4-5 and Example 4-7.
Page 234: Example 4-7 Pextrw Instruction Code
Figure 4-5 pextrw Instruction Example 4-7 pextrw Instruction Code Input: ; Output: movq mm0, [eax] pextrw edx, mm0, 0 Insert Word instruction loads a word from the lower half of a 32-bit pinsrw integer register or from memory and inserts it in the MMX technology destination register at a position defined by the two least significant bits of the immediate constant.
Page 235: Example 4-8 Pinsrw Instruction Code
Figure 4-6 pinsrw Instruction Example 4-8 pinsrw Instruction Code ; Input: ; Output: eax, [edx] pinsrw mm0, eax, 1 If all of the operands in a register are being replaced by a series of instructions, it can be useful to clear the content and break the pinsrw dependence chain by either using the register.
Page 236: Example 4-9 Repeated Pinsrw Instruction Code
Example 4-9 Repeated pinsrw Instruction Code ; Input: ; Output: pxor mm0, mm0 eax, [edx] pinsrw mm0, eax, 0 eax, [edx+10] pinsrw mm0, eax, 1 eax, [edx+13] pinsrw mm0, eax, 2 eax, [edx+24] pinsrw mm0, eax, 3 Move Byte Mask to Integer pmovmskb significant bits of each byte of its source operand.
Page 237: Example 4-10 Pmovmskb Instruction Code
Figure 4-7 pmovmskb Instruction Example Example 4-10 pmovmskb Instruction Code ; Input: source value ; Output: 32-bit register containing the byte mask in the lower eight bits movq mm0, [edi] pmovmskb eax, mm0 Optimizing for SIMD Integer Applications 0..0 0..0 OM15165 4-17...
Page 238: Packed Shuffle Word For 64-Bit Registers
Packed Shuffle Word for 64-bit Registers instruction (see Figure 4-8, Example 4-11) uses the pshuf immediate ( imm8 two MMX registers or one MMX register and a 64-bit memory location. Bits 1 and 0 of the immediate value encode the source for destination word 0 in MMX register ( Bits 1 - 0...
Page 239: Packed Shuffle Word For 128-Bit Registers
Example 4-11 pshuf Instruction Code ; Input: ; Output: movq mm0, [edi] pshufw mm1, mm0, 0x1b Packed Shuffle Word for 128-bit Registers pshuflw pshufhw word field within the low/high 64 bits to any result word field in the low/high 64 bits, using an 8-bit immediate operand; the other high/low 64 bits are passed through from the source operand.
Page 240: Unpacking/Interleaving 64-Bit Data In 128-Bit Registers
Example 4-13 Swap Using 3 Instructions /* Goal: Swap the values in word 6 and word 1 */ /* Instruction Result */ PSHUFD (3,0,1,2)| 7| 6| 1| 0| 3| 2| 5| 4| PSHUFHW (3,1,2,0)| 7| 1| 6| 0| 3| 2| 5| 4| PSHUFD (3,0,1,2)| 7| 1| 5| 4| 3| 2| 6| 0| Example 4-14 Reverse Using 3 Instructions /* Goal:...
Page 241: Example 4-15 Generating Constants
Data Movement There are two additional instructions to enable data movement from the 64-bit SIMD integer registers to the 128-bit SIMD registers. instruction moves the 64-bit integer data from an MMX movq2dq register (source) to a 128-bit destination register. The high-order 64 bits of the destination register are zeroed-out.
Page 242 Example 4-15 Generating Constants (continued) pxor MM0, MM0 pcmpeq MM1, MM1 psubb MM0, MM1 [psubw ; three instructions above generate ; the constant 1 in every ; packed-byte [or packed-word] ; (or packed-dword) field pcmpeq MM1, MM1 psrlw MM1, 16-n(psrld ;...
Page 243: Building Blocks
Building Blocks This section describes instructions and algorithms which implement common code building blocks efficiently. Absolute Difference of Unsigned Numbers Example 4-16 computes the absolute difference of two unsigned numbers. It assumes an unsigned packed-byte data type. Here, we make use of the subtract instruction with unsigned saturation.
Page 244: Example 4-17 Absolute Difference Of Signed Numbers
Absolute Difference of Signed Numbers Chapter 4 computes the absolute difference of two signed numbers. The technique used here is to first sort the corresponding elements of the input operands into packed words of the maximum values, and packed words of the minimum values. Then the minimum values are subtracted from the maximum values to generate the required absolute difference.
Page 245: Example 4-18 Computing Absolute Value
Example 4-17 Absolute Difference of Signed Numbers (continued) movq MM2, MM0 pcmpgtw MM0, MM1 movq MM4, MM2 pxor MM2, MM1 pand MM2, MM0 pxor MM4, MM2 pxor MM1, MM2 psubw MM1, MM4 Absolute Value Use Example 4-18 to compute | assumes signed words to be the operands.
Page 246: Clipping To An Arbitrary Range [High, Low]
Clipping to an Arbitrary Range [high, low] This section explains how to clip a values to a range [ Specifically, if the value is less than high, packed-subtract instructions with saturation (signed or unsigned), which means that this technique can only be used on packed-byte and packed-word data types.
Page 247: Example 4-20 Clipping To An Arbitrary Signed Range [High, Low]
Highly Efficient Clipping For clipping signed words to an arbitrary range, the instructions may be used. For clipping unsigned bytes to an arbitrary range, the pmaxub shows how to clip signed words to an arbitrary range; the code for clipping unsigned bytes is similar. Example 4-19 Clipping to a Signed Range of Words [high, low] ;...
Page 248: Clipping To An Arbitrary Unsigned Range [High, Low]
The code above converts values to unsigned numbers first and then clips them to an unsigned range. The last instruction converts the data back to signed data and places the data within the signed range. Conversion to unsigned data is required for correct results when ( 0x8000 If ( high...
Page 249: Packed Max/Min Of Signed Word And Unsigned Byte
packed-subtract instructions with unsigned saturation, thus this technique can only be used on packed-bytes and packed-words data types. The example illustrates the operation on word values. Example 4-22 Clipping to an Arbitrary Unsigned Range [high, low] ; Input: unsigned source operands ;...
Page 250: Unsigned Byte
Unsigned Byte instruction returns the maximum between the eight pmaxub unsigned bytes in either two SIMD registers, or one SIMD register and a memory location. instruction returns the minimum between the eight pminub unsigned bytes in either two SIMD registers, or one SIMD register and a memory location.
Page 251: Packed Average (Byte/Word)
Figure 4-9 PSADBW Instruction Example The subtraction operation presented above is an absolute difference, that t = abs(x-y values are summed together, and the result is written into the lower word of the destination register. Packed Average (Byte/Word) pavgb pavgw source operand to the unsigned data elements of the destination register, along with a carry-in.
Page 252: Example 4-23 Complex Multiply By A Constant
instruction operates on packed unsigned bytes and the PAVGB instruction operates on packed unsigned words. Complex Multiply by a Constant Complex multiplication is an operation which requires four multiplications and two additions. This is exactly how the instruction operates. In order to use this instruction, you need to format the data into multiple 16-bit values.
Page 253: Packed 32*32 Multiply
Note that the output is a packed doubleword. If needed, a pack instruction can be used to convert the result to 16-bit (thereby matching the format of the input). Packed 32*32 Multiply instruction performs an unsigned multiply on the lower PMULUDQ pair of double-word operands within each 64-bit chunk from the two sources;...
Page 254: Memory Optimizations
Memory Optimizations You can improve memory accesses using the following techniques: • Avoiding partial memory accesses • Increasing the bandwidth of memory fills and video fills • Prefetching data with Streaming SIMD Extensions (see Chapter 6, “Optimizing Cache Usage”). The MMX registers and XMM registers allow you to move large quantities of data without stalling the processor.
Page 255: Partial Memory Accesses
Partial Memory Accesses Consider a case with large load after a series of small stores to the same area of memory (beginning at memory address stall in this case as shown in Example 4-24. Example 4-24 A Large Load after a Series of Small Stores (Penalty) mem, eax mem + 4, ebx movq...
Page 256: Example 4-27 Eliminating Delay For A Series Of Small Loads After A Large Store
Let us now consider a case with a series of small loads after a large store to the same area of memory (beginning at memory address shown in Example 4-26. Most of the small loads will stall because they are not aligned with the store; see “Store Forwarding” in Chapter 2 for more details.
Page 257: Supplemental Techniques For Avoiding Cache Line Splits
Optimizing for SIMD Integer Applications These transformations, in general, increase the number of instructions required to perform the desired operation. For Pentium II, Pentium III, and Pentium 4 processors, the benefit of avoiding forwarding problems outweighs the performance penalty due to the increased number of instructions, making the transformations worthwhile.
Page 258: Example 4-29 Video Processing Using Lddqu To Avoid Cache Line Splits
SSE3 provides an instruction LDDQU for loading from memory address that are not 16 byte aligned. LDDQU is a special 128-bit unaligned load designed to avoid cache line splits. If the address of the load is aligned on a 16-byte boundary, LDQQU loads the 16 bytes requested.
Page 259: Increasing Bandwidth Of Memory Fills And Video Fills
(video fills). These recommendations are relevant for all Intel architecture processors with MMX technology and refer to cases in which the loads and stores do not hit in the first- or second-level cache.
Page 260: Increasing Uc And Wc Store Bandwidth By Using Aligned Stores
same DRAM page have shorter latencies than sequential accesses to different DRAM pages. In many systems the latency for a page miss (that is, an access to a different page instead of the page previously accessed) can be twice as large as the latency of a memory page hit (access to the same page as the previous access).
Page 261: Simd Optimizations And Microarchitectures
— code sequence is rewritten to use the instructions (shift double quad-word operand by bytes). SIMD Optimizations and Microarchitectures Pentium M, Intel Core Solo and Intel Core Duo processors have a different microarchitecture than Intel NetBurst following sections discuss optimizing SIMD code that targets Intel Core Solo and Intel Core Duo processors.
Page 262: Packed Sse2 Integer Versus Mmx Instructions
The net of using 128-bit SIMD integer instruction on Intel Core Solo and Intel Core Duo processors is likely to be slightly positive overall, but there may be a few situations where they will generate an unfavorable performance impact.
Page 263: Chapter 5 Optimizing For Simd Floating-Point Applications
Optimizing for SIMD Floating-point Applications This chapter discusses general rules of optimizing for the single-instruction, multiple-data (SIMD) floating-point instructions available in Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (SSE2)and Streaming SIMD Extensions 3 (SSE3). This chapter also provides examples that illustrate the optimization techniques for single-precision and double-precision SIMD floating-point applications.
Page 264: Planning Considerations
• Use MMX technology instructions and registers or for copying data that is not used later in SIMD floating-point computations. • Use the reciprocal instructions followed by iteration for increased accuracy. These instructions yield reduced accuracy but execute much faster. Note the following: —...
Page 265: Using Simd Floating-Point With X87 Floating-Point
SIMD floating-point code uses a flat register model, whereas x87 floating-point code uses a stack model. Using scalar floating-point code eliminates the need to use performance limit on the Intel Pentium 4 processor. • Mixing with MMX technology code without penalty.
Page 266: Data Alignment
When using scalar floating-point instructions, it is not necessary to ensure that the data appears in vector form. However, all of the optimizations regarding alignment, scheduling, instruction selection, and other optimizations covered in Chapter 2 and Chapter 3 should be observed.
Page 267: Vertical Versus Horizontal Computation
For some applications, e.g., 3D geometry, the traditional data arrangement requires some changes to fully utilize the SIMD registers and parallel techniques. Traditionally, the data layout has been an array of structures (AoS). To fully utilize the SIMD registers in such applications, a new data layout has been proposed—a structure of arrays (SoA) resulting in more optimized performance.
Page 268 simultaneously referred to as an diagram below) are computed in parallel, and the array is updated one vertex at a time. When data structures are organized for the horizontal computation model, sometimes the availability of homogeneous arithmetic operations in SSE and SSE2 may cause inefficiency or require additional intermediate movement between data elements.
Page 269: Table 5-1 Soa Form Of Representing Vertices Data
To utilize all 4 computation slots, the vertex data can be reorganized to allow computation on each component of 4 separate vertices, that is, processing multiple vectors simultaneously. This can also be referred to as an SoA form of representing vertices data shown in Table 5-1. Table 5-1 SoA Form of Representing Vertices Data Vx array...
Page 270: Example 5-1 Pseudocode For Horizontal (Xyz, Aos) Computation
Figure 5-2 Dot Product Operation Figure 5-2 shows how 1 result would be computed for 7 instructions if the data were organized as AoS and using SSE alone: 4 results would require 28 instructions. Example 5-1 Pseudocode for Horizontal (xyz, AoS) Computation mulps movaps shufps...
Page 271: Example 5-2 Pseudocode For Vertical (Xxxx, Yyyy, Zzzz, Soa) Computation
Now consider the case when the data is organized as SoA. Example 5-2 demonstrates how 4 results are computed for 5 instructions. Example 5-2 Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation mulps ; x*x' for all 4 x-components of 4 vertices mulps ;...
Page 272: Example 5-3 Swizzling Data
To gather data from 4 different memory locations on the fly, follow steps: Identify the first half of the 128-bit memory location. Group the different halves together using the form an xyxy From the 4 attached halves, get the by using another shuffle. yyyy is derived the same way but only requires one shuffle.
Page 273 Example 5-3 Swizzling Data (continued) y1 x1 movhps xmm7, [ecx+16] movlps xmm0, [ecx+32] movhps xmm0, [ecx+48] movaps xmm6, xmm7 shufps xmm7, xmm0, 0x88 shufps xmm6, xmm0, 0xDD movlps xmm2, [ecx+8] movhps xmm2, [ecx+24] movlps xmm1, [ecx+40] movhps xmm1, [ecx+56] movaps xmm0, xmm2 shufps xmm2, xmm1, 0x88 movlps xmm7, [ecx] movaps [edx], xmm7...
Page 274: Example 5-4 Swizzling Data Using Intrinsics
Example 5-4 shows the same data -swizzling algorithm encoded using the Intel C++ Compiler’s intrinsics for SSE. Example 5-4 Swizzling Data Using Intrinsics //Intrinsics version of data swizzle void swizzle_intrin (Vertex_aos *in, Vertex_soa *out, int stride) __m128 x, y, z, w;...
Page 275 CAUTION. previous computations because the instructions bypass one part of the register. The same issue can occur with the use of an exclusive-OR function within an inner loop in order to clear a register: xorps xmm0, xmm0 Although the generated result of all zeros does not depend on the specific data contained in the source operand (that is, with itself always produces all zeros), the instruction cannot execute until the instruction that generates...
Page 276: Example 5-5 Deswizzling Single-Precision Simd Data
Data Deswizzling In the deswizzle operation, we want to arrange the SoA format back into AoS format so the memory as instructions to regenerate the into its corresponding memory location using by another movlps Example 5-5 illustrates the deswizzle function: Example 5-5 Deswizzling Single-Precision SIMD Data void deswizzle_asm(Vertex_soa *in, Vertex_aos *out)
Page 277: Example 5-6 Deswizzling Data Using The Movlhps And Shuffle Instructions
Example 5-5 Deswizzling Single-Precision SIMD Data (continued) unpcklps xmm5, xmm4 unpckhps xmm0, xmm4 movlps [edx+8], xmm5 movhps [edx+24], xmm5 movlps [edx+40], xmm0 movhps [edx+56], xmm0 // DESWIZZLING ENDS HERE You may have to swizzle data in the registers, but not in memory. This occurs when two different functions need to process the data in different layout.
Page 278: Example 5-7 Deswizzling Data 64-Bit Integer Simd Data
Example 5-6 Deswizzling Data Using the movlhps and shuffle Instructions (continued) // Start deswizzling here movaps xmm7, xmm4 movhlps xmm7, xmm3 movaps xmm6, xmm2 movlhps xmm3, xmm4 movhlps xmm2, xmm1 movlhps xmm1, xmm6 movaps xmm6, xmm2 movaps xmm5, xmm1 shufps xmm2, xmm7, 0xDD // xmm2= r4 g4 b4 a4 shufps xmm1, xmm3, 0x88 // xmm4= r1 g1 b1 a1 shufps xmm5, xmm3, 0x88 // xmm5= r2 g2 b2 a2 shufps xmm6, xmm7, 0xDD // xmm6= r3 g3 b3 a3...
Page 279: Using Mmx Technology Code For Copy Or Shuffling Functions
Example 5-7 Deswizzling Data 64-bit Integer SIMD Data (continued) movq mm1, [ebx+16] movq mm2, mm0 punpckhdq punpckldq movq [edx], mm2 movq [edx+8], mm0 movq mm4, [ebx+8] movq mm5, [ebx+24] movq mm6, mm4 punpckhdq mm4, mm5 punpckldq mm6, mm5 movq [edx+16], mm6 movq [edx+24], mm4 Using MMX Technology Code for Copy or Shuffling Functions...
Page 280: Example 5-8 Using Mmx Technology Code For Copying Or Shuffling
Example 5-8 illustrates how to use MMX technology code for copying or shuffling. Example 5-8 Using MMX Technology Code for Copying or Shuffling movq movq movq punpckhdq punpckldq movq movq movq movq movq punpckhdq punpckldq movq movq Horizontal ADD Using SSE Although vertical computations use the SIMD performance better than horizontal computations do, in some cases, the code must use a horizontal operation.
Page 281: Figure 5-3 Horizontal Add Using Movhlps/Movlhps
Figure 5-3 Horizontal Add Using movhlps/movlhps xm m 0 M OVLHPS ADDPS A1+A3 A2+A4 SHUFPS A1+A3 B1+B3 C1+C3 A1+A2+A3+A4 Example 5-9 Horizontal Add Using movhlps/movlhps void horiz_add(Vertex_soa *in, float *out) { __asm { ecx, in edx, out movaps xmm0, [ecx] movaps xmm1, [ecx+16] movaps...
Page 282 Example 5-9 Horizontal Add Using movhlps/movlhps (continued) // START HORIZONTAL ADD movaps xmm5, xmm0 movlhps xmm5, xmm1 movhlps xmm1, xmm0 addps xmm5, xmm1 movaps xmm4, xmm2 movlhps xmm2, xmm3 movhlps xmm3, xmm4 addps xmm3, xmm2 movaps xmm6, xmm3 shufps xmm3, xmm5, 0xDD shufps xmm5, xmm6, 0x88 addps xmm6, xmm5...
Page 283: Use Of Cvttps2Pi/Cvttss2Si Instructions
Example 5-10 Horizontal Add Using Intrinsics with movhlps/movlhps void horiz_add_intrin(Vertex_soa *in, float *out) __m128 v1, v2, v3, v4; __m128 tmm0,tmm1,tmm2,tmm3,tmm4,tmm5,tmm6; tmm0 = _mm_load_ps(in->x); tmm1 = _mm_load_ps(in->y); tmm2 = _mm_load_ps(in->z); tmm3 = _mm_load_ps(in->w); tmm5 = tmm0; tmm5 = _mm_movelh_ps(tmm5, tmm1); tmm1 = _mm_movehl_ps(tmm1, tmm0); tmm5 = _mm_add_ps(tmm5, tmm1);...
Page 284: Flush-To-Zero And Denormals-Are-Zero Modes
avoided since there is a penalty associated with writing this register; typically, through the use of the the rounding control in Flush-to-Zero and Denormals-are-Zero Modes The flush-to-zero (FTZ) and denormals-are-zero (DAZ) mode are not compatible with IEEE Standard 754. They are provided to improve performance for applications where underflow is common and where the generation of a denormalized result is not necessary.
Page 285: Sse3 And Complex Arithmetics
Figure 5-4 Asymmetric Arithmetic Operation of the SSE3 Instruction X1 + Y1 Figure 5-5 Horizontal Arithmetic Operation of the SSE3 Instruction HADDPD Y0 + Y1 SSE3 and Complex Arithmetics The flexibility of SSE3 in dealing with AOS-type of data structure can be demonstrated by the example of multiplication and division of complex numbers.
Page 286: Example 5-11 Multiplication Of Two Pair Of Single-Precision Complex Number
instructions to perform multiplications of single-precision complex numbers. Example 5-12 demonstrates using SSE3 instructions to perform division of complex numbers. Example 5-11 Multiplication of Two Pair of Single-precision Complex Number // Multiplication of // a + i b can be stored as a data structure movsldup xmm0, Src1;...
Page 287: Example 5-12 Division Of Two Pair Of Single-Precision Complex Number
Example 5-12 Division of Two Pair of Single-precision Complex Number // Division of (ak + i bk ) / (ck + i dk ) movshdup xmm0, Src1; load imaginary parts into the movaps xmm1, src2; load the 2nd pair of complex values, mulps xmm0, xmm1;...
Page 288: Sse3 And Horizontal Computation
SSE3 and Horizontal Computation Sometimes the AOS type of data organization are more natural in many algebraic formula. SSE3 enhances the flexibility of SIMD programming for applications that rely on the horizontal computation model. SSE3 offers several instructions that are capable of horizontal arithmetic operations.
Page 289: Simd Optimizations And Microarchitectures
SIMD Optimizations and Microarchitectures Pentium M, Intel Core Solo and Intel Core Duo processors have a different microarchitecture than Intel NetBurst following sub-section discusses optimizing SIMD code that target Intel Core Solo and Intel Core Duo processors.
Page 290 Packed horizontal SSE3 instructions (haddps and hsubps) can simplify the code sequence for some tasks. However, these instruction consist of more than five micro-ops on Intel Core Solo and Intel Core Duo processors. Care must be taken to ensure the latency and decoding penalty of the horizontal instruction does not offset any algorithmic benefits.
Page 291: Chapter 6 Optimizing Cache Usage
Optimizing Cache Usage Over the past decade, processor speed has increased more than ten times. Memory access speed has increased at a slower pace. The resulting disparity has made it important to tune applications in one of two ways: either (a) a majority of the data accesses are fulfilled from processor caches, or (b) effectively masking memory latency to utilize peak memory bandwidth as much as possible.
Page 292: General Prefetch Coding Guidelines
The examples of such Intel _mm_prefetch details on these intrinsics, refer to the Intel® C++ Compiler User’s Guide, doc. number 718195. NOTE. In a number of cases presented in this chapter,...
Page 293 • Facilitate compiler optimization: — Minimize use of global variables and pointers — Minimize use of complex control flow — Use the modifier, avoid const — Choose data types carefully (see below) and avoid type casting. • Use cache blocking techniques (for example, strip mining): —...
Page 294: Hardware Prefetching Of Data
Hardware Prefetching of Data The Pentium 4, Intel Xeon, Pentium M, Intel Core Solo and Intel Core Duo processors implement a hardware automatic data prefetcher which monitors application data access patterns and prefetches data automatically.
Page 295: Prefetch And Cacheability Instructions
Data Reads for load streams. Other than the items 2 and 4 discussed above, most other characteristics also apply to Pentium M, Intel Core Solo and Intel Core Duo processors. The hardware prefetcher implemented in the Pentium M processor fetches data to a second level cache.
Page 296: Prefetch
Data reference patterns can be classified as follows: Temporal Spatial Non-temporal These data characteristics are used in the discussions that follow. Prefetch This section discusses the mechanics of the software prefetch instructions. In general, software prefetch instructions should be used to supplement the practice of tuning a access pattern to suit the automatic hardware prefetch mechanism.
Page 297 instruction is implementation-specific; applications need prefetch to be tuned to each implementation to maximize performance. Using the NOTE. recommended only if data does not fit in cache. instructions merely provide a hint to the hardware, and prefetch they will not generate exceptions or faults except for a few special cases (see the “Prefetch and Load Instructions”...
Page 298: The Prefetch Instructions - Pentium 4 Processor Implementation
The Prefetch Instructions – Pentium 4 Processor Implementation Streaming SIMD Extensions include four flavors of instructions, one non-temporal, and three temporal. They correspond to two types of operations, temporal and non-temporal. The non-temporal instruction is prefetchnta The temporal instructions are prefetcht0 prefetcht1 prefetcht2...
Page 299: Cacheability Control
Currently, the prefetch than preloading because it: • has no destination register, it only updates cache lines. • does not stall the normal instruction retirement. • does not affect the functional behavior of the program. • has no cache line split accesses. •...
Page 300: The Non-Temporal Store Instructions
The Non-temporal Store Instructions This section describes the behavior of streaming stores and reiterates some of the information presented in the previous section. In Streaming SIMD Extensions, the maskmovq stores. With regard to memory characteristics and ordering, they are similar mostly to the Write-Combining ( •...
Page 301: Memory Type And Non-Temporal Stores
(with semantics). Note that the approaches (separate or combined) can be different for future processors. The Pentium 4, Intel Core Solo and Intel Core Duo processors implement the latter policy (of Optimizing Cache Usage ) or memory type range registers...
Page 302: Write-Combining
evicting data from all processor caches). The Pentium M processor implements a combination of both approaches. If the streaming store hits a line that is present in the first-level cache, the store data is combined in place within the first-level cache.
Page 303: Streaming Store Usage Models
Optimizing Cache Usage possible. This behavior should be considered reserved, and dependence on the behavior of any particular implementation risks future incompatibility. Streaming Store Usage Models The two primary usage domains for streaming store are coherent requests and non-coherent requests. Coherent Requests Coherent requests are normal loads and stores to system memory, which may also hit cache lines present in another processor in a...
Page 304: Streaming Store Instruction Descriptions
In case the region is not mapped as in-place in the cache and a subsequent data being written to system memory. Explicitly mapping the region as in this case ensures that any data read from this region will not be placed in the processor’s caches.
Page 305: The Fence Instructions
maskmovq/maskmovdqu integer in an MMX technology or Streaming SIMD Extensions register) instructions store data from a register to the location specified by the register. The most significant bit in each byte of the second mask register is used to selectively write the data of the first register on a per-byte basis.
Page 306: The Lfence Instruction
The degree to which a consumer of data knows that the data is weakly-ordered can vary for these cases. As a result, the instruction should be used to ensure ordering between routines that produce weakly-ordered data and routines that consume this data. The instruction provides a performance-efficient way by ensuring sfence the ordering when every...
Page 307: The Clflush Instruction
The clflush Instruction The cache line associated with the linear address specified by the value of byte address is invalidated from all levels of the processor cache hierarchy (data and instruction). The invalidation is broadcast throughout the coherence domain. If, at any level of the cache hierarchy, the line is inconsistent with memory (dirty) it is written to memory before invalidation.
Page 308: Example 6-1 Pseudo-Code For Using Cflush
Example 6-1 Pseudo-code for Using cflush while (!buffer_ready} {} mfence for(i=0;i<num_cachelines;i+=cacheline_size) { clflush (char *)((unsigned int)buffer + i) mfence prefnta buffer[0]; VAR = buffer[0]; Memory Optimization Using Prefetch The Pentium 4 processor has two mechanisms for data prefetch: software-controlled prefetch and an automatic hardware prefetch. Software-controlled Prefetch The software-controlled prefetch is enabled using the four prefetch instructions introduced with Streaming SIMD Extensions instructions.
Page 309: Hardware Prefetch
Hardware Prefetch The automatic hardware prefetch, can bring cache lines into the unified last-level cache based on prior data misses. The automatic hardware prefetcher will attempt to prefetch two cache lines ahead of the prefetch stream. This feature is introduced with the Pentium 4 processor. The characteristics of the hardware prefetching are as follows: •...
Page 310: Example Of Effective Latency Reduction With H/W Prefetch
• May consume extra system bandwidth if the application’s memory traffic has significant portions with strides of cache misses greater than the trigger distance threshold of hardware prefetch (large-stride memory traffic). • Effectiveness with existing applications depends on the proportions of small-stride versus large-stride accesses in the application’s memory traffic.
Page 311: Example 6-2 Populating An Array For Circular Pointer Chasing With Constant Stride
Example 6-2 Populating an Array for Circular Pointer Chasing with Constant Stride register char ** p; *next; // Populating pArray for circular pointer char p = ( char **)*p; loads a value pointing to next load p = (char **)&pArray; for (i = 0;...
Page 312: Example Of Latency Hiding With S/W Prefetch Instruction
Figure 6-1 Effective Latency Reduction as a Function of Access Stride U p p e r b o u n d o f P o in te r -C h a s in g L a te n c y R e d u c tio n 1 2 0 % 1 0 0 % 8 0 %...
Page 313: Figure 6-2 Memory Access Latency And Execution Without Prefetch
execution units sit idle and wait until data is returned. On the other hand, the memory bus sits idle while the execution units are processing vertices. This scenario severely decreases the advantage of having a decoupled architecture. Figure 6-2 Memory Access Latency and Execution Without Prefetch Execution Execution units idle pipeline...
Page 314: Software Prefetching Usage Checklist
The performance loss caused by poor utilization of resources can be completely eliminated by correctly scheduling the prefetch instructions appropriately. As shown in Figure 6-3, prefetch instructions are issued two vertex iterations ahead. This assumes that only one vertex gets processed in one iteration and a new data cache line is needed for each iteration.
Page 315: Software Prefetch Scheduling Distance
• Balance single-pass versus multi-pass execution • Resolve memory bank conflict issues • Resolve cache management issues The subsequent sections discuss all the above items. Software Prefetch Scheduling Distance Determining the ideal prefetch placement in the code depends on many architectural parameters, including the amount of memory to be prefetched, cache lookup latency, system memory latency, and estimate of computation cycle.
Page 316: Example 6-3 Prefetch Scheduling Distance
lines of data per iteration. The PSD would need to be increased/decreased if more/less than two cache lines are used per iteration. Example 6-3 Prefetch Scheduling Distance top_loop: prefetchnta [edx + esi + 128*3] prefetchnta [edx*4 + esi + 128*3] .
Page 317 Optimizing Cache Usage This memory de-pipelining creates inefficiency in both the memory pipeline and execution pipeline. This de-pipelining effect can be removed by applying a technique called prefetch concatenation. With this technique, the memory access and execution can be fully pipelined and fully utilized.
Page 318: Example 6-5 Concatenation And Unrolling The Last Iteration Of Inner Loop
Example 6-4 Using Prefetch Concatenation for (ii = 0; ii < 100; ii++) { for (jj = 0; jj < 32; jj+=8) { prefetch a[ii][jj+8] computation a[ii][jj] Prefetch concatenation can bridge the execution pipeline bubbles between the boundary of an inner loop and its associated outer loop. Simply by unrolling the last iteration out of the inner loop and specifying the effective prefetch address for data used in the following iteration, the performance loss of memory de-pipelining can be...
Page 319: Minimize Number Of Software Prefetches
Minimize Number of Software Prefetches Prefetch instructions are not completely free in terms of bus cycles, machine cycles and resources, even though they require minimal clocks and memory bandwidth. Excessive prefetching may lead to performance penalties because issue penalties in the front-end of the machine and/or resource contention in the memory sub-system.
Page 320 Figure 6-5Figure demonstrates the effectiveness of software prefetches in latency hiding. The X axis indicates the number of computation clocks per loop (each iteration is independent). The Y axis indicates the execution time measured in clocks per loop. The secondary Y axis indicates the percentage of bus bandwidth utilization.
Page 321: Figure 6-5 Memory Access Latency And Execution With Prefetch
Figure 6-5 Memory Access Latency and Execution With Prefetch 16_por One load and one store stream 32_por 64_por 128_por None_por % Bus Utilization Computations per loop 2 Load streams, 1 store stream % Bus Utilization Computations per loop Optimizing Cache Usage 100.00% 90.00% 80.00%...
Page 322: Mix Software Prefetch With Computation Instructions
Mix Software Prefetch with Computation Instructions It may seem convenient to cluster all of the prefetch instructions at the beginning of a loop body or before a loop, but this can lead to severe performance degradation. In order to achieve best possible performance, prefetch instructions must be interspersed with other computational instructions in the instruction sequence rather than clustered together.
Page 323: Example 6-6 Spread Prefetch Instructions
Example 6-6 Spread Prefetch Instructions top_loop: prefetchnta [ebx+128] prefetchnta [ebx+1128] prefetchnta [ebx+2128] prefetchnta [ebx+3128] ..prefetchnta [ebx+17128] prefetchnta [ebx+18128] prefetchnta [ebx+19128] prefetchnta [ebx+20128] movps xmm1, [ebx] addps xmm2, [ebx+3000] mulps xmm3, [ebx+4000] addps xmm1, [ebx+1000] addps xmm2, [ebx+3016] mulps xmm1, [ebx+2000] mulps xmm1, xmm2...
Page 324: Software Prefetch And Cache Blocking Techniques
Software Prefetch and Cache Blocking Techniques Cache blocking techniques, such as strip-mining, are used to improve temporal locality, and thereby cache hit rate. Strip-mining is a one-dimensional temporal locality optimization for memory. When two-dimensional arrays are used in programs, loop blocking technique (similar to strip-mining but in two dimensions) can be applied for a better memory performance.
Page 325: Figure 6-6 Cache Blocking - Temporally Adjacent And Non-Adjacent Passes
Figure 6-6 Cache Blocking – Temporally Adjacent and Non-adjacent Passes Dataset A Dataset A Dataset B Dataset B Temporally adjacent passes In the temporally-adjacent scenario, subsequent passes use the same data and find it already in second-level cache. Prefetch issues aside, this is the preferred situation.
Page 326: Figure 6-7 Examples Of Prefetch And Strip-Mining For Temporally Adjacent And Non-Adjacent Passes Loops
Figure 6-7 shows how prefetch instructions and strip-mining can be applied to increase performance in both of these scenarios. Figure 6-7 Examples of Prefetch and Strip-mining for Temporally Adjacent and Non-Adjacent Passes Loops Temporally adjacent passes For Pentium 4 processors, the left scenario shows a graphical implementation of using ways of the second-level cache only (SM1 denotes strip mine one way of second-level), minimizing second-level cache pollution.
Page 327: Example 6-7 Data Access Of A 3D Geometry Engine Without Strip-Mining
In scenario to the right, in Figure 6-7, keeping the data in one way of the second-level cache does not improve cache locality. Therefore, use to prefetch the data. This amortizes the latency of the prefetcht0 memory references in passes 1 and 2, and keeps a copy of the data in second-level cache, which reduces memory traffic and latencies for passes 3 and 4.
Page 328: Example 6-8 Data Access Of A 3D Geometry Engine With Strip-Mining
Without strip-mining, all the x,y,z coordinates for the four vertices must be re-fetched from memory in the second pass, that is, the lighting loop. This causes under-utilization of cache lines fetched during transformation loop as well as bandwidth wasted in the lighting loop. Now consider the code in Example 6-8 where strip-mining has been incorporated into the loops.
Page 329: Hardware Prefetching And Cache Blocking Techniques
Table 6-1 summarizes the steps of the basic usage model that incorporates only software prefetch with strip-mining. The steps are: • Do strip-mining: partition loops so that the dataset fits into second-level cache. • prefetchnta into 32K (one way of second-level cache). Use dataset exceeds 32K.
Page 330: Example 6-9 Using Hw Prefetch To Improve Read-Once Memory Traffic
happen to be powers of 2, aliasing condition due to finite number of way-associativity (see “Capacity Limits and Aliasing in Caches” in Chapter 2) will exacerbate the likelihood of cache evictions. Example 6-9 Using HW Prefetch to Improve Read-Once Memory Traffic Un-optimized image transpose // dest and src represent two-dimensional arrays for( i = 0;i <...
Page 331: Single-Pass Versus Multi-Pass Execution
references enables the hardware prefetcher to initiate bus requests to read some cache lines before the code actually reference the linear addresses. Single-pass versus Multi-pass Execution An algorithm can use single- or multi-pass execution defined as follows: • Single-pass, or unlayered execution passes a single data element through an entire computation pipeline.
Page 332: Figure 6-8 Single-Pass Vs. Multi-Pass 3D Geometry Engines
selected to ensure that the batch stays within the processor caches through all passes. An intermediate cached buffer is used to pass the batch of vertices from one stage or pass to the next one. Single-pass execution can be better suited to applications which limit the number of features that may be used at a given time.
Page 333: Memory Optimization Using Non-Temporal Stores
The choice of single-pass or multi-pass can have a number of performance implications. For instance, in a multi-pass pipeline, stages that are limited by bandwidth (either input or output) will reflect more of this performance limitation in overall execution time. In contrast, for a single-pass approach, bandwidth-limitations can be distributed/ amortized across other computation-intensive stages.
Page 334: Cache Management
In addition, the Pentium 4 processor takes advantage of the Intel C ++ Compiler that supports C ++ language-level features for the Streaming SIMD Extensions. The Streaming SIMD Extensions and MMX technology instructions provide intrinsics that allow you to optimize cache utilization.
Page 335: Video Encoder
Optimizing Cache Usage The following examples of using prefetching instructions in the operation of video encoder and decoder as well as in simple 8-byte memory copy, illustrate performance gain from using the prefetching instructions for efficient cache management. Video Encoder In a video encoder example, some of the data used during the encoding process is kept in the processor’s second-level cache, to minimize the number of reference streams that must be re-read from system memory.
Page 336: Conclusions From Video Encoder And Decoder Implementation
Later, the processor re-reads the data using maximum bandwidth, yet minimizes disturbance of other cached temporal data by using the non-temporal (NTA) version of prefetch. Conclusions from Video Encoder and Decoder Implementation These two examples indicate that by using an appropriate combination of non-temporal prefetches and non-temporal stores, an application can be designed to lessen the overhead of memory transactions by preventing second-level cache pollution, keeping useful data in the...
Page 337: Tlb Priming
The memory copy algorithm can be optimized using the Streaming SIMD Extensions with these considerations: • alignment of data • proper layout of pages in memory • cache size • interaction of the transaction lookaside buffer (TLB) with memory accesses •...
Page 338: Using The 8-Byte Streaming Stores And Software Prefetch
Using the 8-byte Streaming Stores and Software Prefetch Example 6-11 presents the copy algorithm that uses second level cache. The algorithm performs the following steps: Uses blocking technique to transfer 8-byte data from memory into second-level cache using the a time to fill a block. The size of a block should be less than one half of the size of the second-level cache, but large enough to amortize the cost of the loop.
Page 339 Example 6-11 A Memory Copy Routine Using Software Prefetch // copy 128 byte per loop for (j=kk; j<kk+NUMPERPAGE; j+=16) { _mm_stream_ps((float*)&b[j], _mm_load_ps((float*)&a[j])); _mm_stream_ps((float*)&b[j+2], _mm_load_ps((float*)&a[j+2])); _mm_stream_ps((float*)&b[j+4], _mm_load_ps((float*)&a[j+4])); _mm_stream_ps((float*)&b[j+6], _mm_load_ps((float*)&a[j+6])); _mm_stream_ps((float*)&b[j+8], _mm_load_ps((float*)&a[j+8])); _mm_stream_ps((float*)&b[j+10], _mm_load_ps((float*)&a[j+10])); _mm_stream_ps((float*)&b[j+12], _mm_load_ps((float*)&a[j+12])); _mm_stream_ps((float*)&b[j+14], _mm_load_ps((float*)&a[j+14])); // finished copying one block // finished copying N elements _mm_sfence();...
Page 340: Using 16-Byte Streaming Stores And Hardware Prefetch
The instruction, table entry for array, and This is essentially a prefetch itself, as a cache line is filled from that memory location with this instruction. Hence, the prefetching starts from in this loop. kk+4 This example assumes that the destination of the copy is not temporally adjacent to the code.
Page 341 Optimizing Cache Usage prefetch_loop: movaps xmm0, [esi+ecx] movaps xmm0, [esi+ecx+64] add ecx,128 cmp ecx,BLOCK_SIZE jne prefetch_loop xor ecx,ecx align 16 cpy_loop: movdqa xmm0,[esi+ecx] movdqa xmm1,[esi+ecx+16] movdqa xmm2,[esi+ecx+32] movdqa xmm3,[esi+ecx+48] movdqa xmm4,[esi+ecx+64] movdqa xmm5,[esi+ecx+16+64] movdqa xmm6,[esi+ecx+32+64] movdqa xmm7,[esi+ecx+48+64] movntdq [edi+ecx],xmm0 movntdq [edi+ecx+16],xmm1 movntdq [edi+ecx+32],xmm2 movntdq [edi+ecx+48],xmm3 movntdq [edi+ecx+64],xmm4...
Page 342: Performance Comparisons Of Memory Copy Routines
Table 6-2. Table 6-2 Relative Performance of Memory Copy Routines Processor, CPUID Signature and FSB Speed Pentium M processor, 0x6Dn, 400 Intel Core Solo and Intel Core Duo processors, 0x6En, Pentium D processor, 0xF4n, 800 6-52 Byte DWORD...
Page 343: Deterministic Cache Parameters
If CPUID support the function leaf with input EAX = 4, this is referred to as the deterministic cache parameter leaf of CPUID (see CPUID instruction in IA-32 Intel® Architecture Software Developer’s Manual, Volume 2A). Software can use the deterministic cache parameter leaf to...
Page 344: Table 6-3 Deterministic Cache Parameters Leaf
query each level of the cache hierarchy. Enumeration of each cache level is by specifying an index value (starting form 0) in the ECX register. The list of parameters is shown in Table 6-3. Table 6-3 Deterministic Cache Parameters Leaf Bit Location EAX[4:0] EAX[7:5]...
Page 345: Cache Sharing Using Deterministic Cache Parameters
• Determine multi-threading resource topology in an MP system (See Section 7.10 of IA-32 Intel® Architecture Software Developer’s Manual, Volume 3A). • Determine cache hierarchy topology in a platform using multi-core processors (See Example 7-13). • Manage threads and processor affinities.
Page 346: Determine Prefetch Stride Using Deterministic Cache Parameters
platform, software can extract information on the number and the identities of each logical processor sharing that cache level and is made available to application by the OS. This is discussed in detail in “Using Shared Execution Resources in a Processor Core” in Chapter 7 and Example 7-13.
Page 347: Multi-Core And Hyper-Threading Technology
The number of logical processors present in each package can also be obtained from CPUID. The application must check how many logical processors are enabled and made available to application at runtime by making the appropriate operating system calls. See the IA-32 Intel® Architecture Software Developer’s Manual, Volume 2A for more information.
Page 348: Performance And Usage Models
cores but shared by two logical processors in the same core if Hyper-Threading Technology is enabled. This chapter covers guidelines that apply to either situations. This chapter covers • Performance characteristics and usage models, • Programming models for multithreaded applications, •...
Page 349: Figure 7-1 Amdahl's Law And Mp Speed-Up
Figure 7-1 illustrates how performance gains can be realized for any workload according to Amdahl’s law. The bar in Figure 7-1 represents an individual task unit or the collective workload of an entire application. In general, the speed-up of running multiple threads on an MP systems with N physical processors, over single-threaded execution, can be expressed as: RelativeResponse...
Page 350: Multitasking Environment
When optimizing application performance in a multithreaded environment, control flow parallelism is likely to have the largest impact on performance scaling with respect to the number of physical processors and to the number of logical processors per physical processor. If the control flow of a multi-threaded application contains a workload in which only 50% can be executed in parallel, the maximum performance gain using two physical processors is only 33%, compared to using a single processor.
Page 351 terms of time of completion relative to the same task when in a single-threaded environment) will vary, depending on how much shared execution resources and memory are utilized. For development purposes, several popular operating systems (for example Microsoft Windows* XP Professional and Home, Linux* distributions using kernel 2.4.19 or later can manage the task scheduling and the balancing of shared execution resources within each physical processor to maximize the throughput.
Page 352: Programming Models And Multithreading
When two applications are employed as part of a multi-tasking workload, there is little synchronization overhead between these two processes. It is also important to ensure each application has minimal synchronization overhead within itself. An application that uses lengthy spin loops for intra-process synchronization is less likely to benefit from Hyper-Threading Technology in a multi-tasking workload.
Page 353: Parallel Programming Models
Parallel Programming Models Two common programming models for transforming independent task requirements into application threads are: • domain decomposition • functional decomposition Domain Decomposition Usually large compute-intensive tasks use data sets that can be divided into a number of small subsets, each having a large degree of computational independence.
Page 354: Functional Decomposition
IA-32 processor supporting Hyper-Threading Technology. Specialized Programming Models Intel Core Duo processor offers a second-level cache shared by two processor cores in the same physical package. This provides opportunities for two application threads to access some application data while minimizing the overhead of bus traffic.
Page 355: Example 7-1 Serial Execution Of Producer And Consumer Work Items
overhead when buffers are exchanged between the producer and consumer. To achieve optimal scaling with the number of cores, the synchronization overhead must be kept low. This can be done by ensuring the producer and consumer threads have comparable time constants for completing each incremental task prior to exchanging buffers.
Page 356: Producer-Consumer Threading Models
The gap between each task represents synchronization overhead. The decimal number in the parenthesis represents a buffer index. On an Intel Core Duo processor, the producer thread can store data in the second-level cache to allow the consumer thread to continue work requiring minimal bus traffic.
Page 357: Example 7-2 Basic Structure Of Implementing Producer Consumer Threads
Example 7-2 Basic Structure of Implementing Producer Consumer Threads (a) Basic structure of a producer thread function void producer_thread() int iter_num = workamount - 1; // make local copy int mode1 = 1; produce(buffs[0],count); // placeholder function while (iter_num--) { Signal(&signal1,1);...
Page 358: Figure 7-4 Interlaced Variation Of The Producer Consumer Model
corresponding task to use its designated buffer. Thus, the producer and consumer tasks execute in parallel in two threads. As long as the data generated by the producer reside in either the first or second level cache of the same core, the consumer can access them without incurring bus traffic.
Page 359: Example 7-3 Thread Function For An Interlaced Producer Consumer Model
Example 7-3 Thread Function for an Interlaced Producer Consumer Model // master thread starts the first iteration, the other thread must wait // one iteration void producer_consumer_thread(int master) int mode = 1 - master; // track which thread and its designated buffer index unsigned int iter_num = workamount >>...
Page 360: Tools For Creating Multithreaded Applications
(API) is not the only method for creating multithreaded applications. New tools such as the Intel available with capabilities that make the challenge of creating multithreaded application easier. Two features available in the latest Intel Compilers are: • generating multithreaded code using OpenMP* directives •...
Page 361 Thread Profiler. Thread Profiler is a plug-in data collector for the Intel VTune Performance Analyzer. Use it to analyze threading performance and identify parallel performance bottlenecks. It graphically illustrates what each thread is doing at various levels of detail using a hierarchical summary.
Page 362: Optimization Guidelines
Optimization Guidelines This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance): • thread synchronization • bus utilization • memory optimization • front end optimization • execution resource optimization Practices associated with each area are listed in this section. Guidelines for each area are discussed in greater depth in sections that follow.
Page 363: Key Practices Of System Bus Optimization
• Place each synchronization variable alone, separated by 128 bytes or in a separate cache line. See “Thread Synchronization” for more details. Key Practices of System Bus Optimization Managing bus traffic can significantly impact the overall performance of multithreaded software and MP systems. Key practices of system bus optimization for achieving high data throughput and quick response are: •...
Page 364: Key Practices Of Front-End Optimization
• Adjust the private stack of each thread in an application so the spacing between these stacks is not offset by multiples of 64 KB or 1 MB (prevents unnecessary cache line evictions) when targeting IA-32 processors supporting Hyper-Threading Technology. •...
Page 365: Generality And Performance Impact
• For each processor supporting Hyper-Threading Technology, consider adding functionally uncorrelated threads to increase the hardware resource utilization of each physical processor package. See “Using Thread Affinities to Manage Shared Platform Resources” for more details. Generality and Performance Impact The next five sections cover the optimization techniques in detail. Recommendations discussed in each section are ranked by importance in terms of estimated local impact and generality.
Page 366: Choice Of Synchronization Primitives
The best practice to reduce the overhead of thread synchronization is to start by reducing the application’s requirements for synchronization. Intel Thread Profiler can be used to profile the execution timeline of each thread and detect situations where performance is impacted by frequent occurrences of synchronization overhead.
Page 367: Table 7-1 Properties Of Synchronization Objects
Profiler can be very useful in dealing with multi-threading functional correctness issue and performance impact under multi-threaded execution. Additional information on the capabilities of Intel Thread Checker and Thread Profiler are described in Appendix A. Table 7-1 is useful for comparing the properties of three categories of synchronization objects available to multi-threaded applications.
Page 368: Synchronization For Short Periods
Table 7-1 Properties of Synchronization Objects (Contd.) Operating System Synchronization Characteristics Objects Miscellaneous Some objects provide intra-process synchronization and some are for inter-process communication Recommended 1. # of active threads use conditions > # of cores. 2. Waiting thousands of cycles for a signal.
Page 369 This penalty occurs on the Pentium M processor, the Intel Core Solo and Intel Core Duo processors. However, the penalty on these processors is small compared with penalties suffered on the Pentium 4 and Intel Xeon processors.
Page 370: Example 7-4 Spin-Wait Loop And Pause Instructions
Example 7-4 Spin-wait Loop and PAUSE Instructions (a) An un-optimized spin-wait loop experiences performance penalty when exiting the loop. It consumes execution resources without contributing computational work. do { // This loop can run faster than the speed of memory access, // other worker threads cannot finish modifying sync_var until // outstanding loads from the spinning loops are resolved.
Page 371: Optimization With Spin-Locks
PAUSE loop is shown in Example 7-4(b). The instruction is compatible PAUSE with all IA-32 processors. On IA-32 processors prior to Intel NetBurst microarchitecture, the instruction is essentially a instruction. PAUSE Additional examples of optimizing spin-wait loops using the PAUSE instruction are available in Application Note AP-949 “Using...
Page 372: Synchronization For Longer Periods
To reduce the performance penalty, one approach is to reduce the likelihood of many threads competing to acquire the same lock. Apply a software pipelining technique to handle data that must be shared between multiple threads. Instead of allowing multiple threads to compete for a given lock, no more than two threads should have write access to a given lock.
Page 373 If an application thread must remain idle for a long time, the application should use a thread blocking API or other method to release the idle processor. The techniques discussed here apply to traditional MP system, but they have an even higher impact on IA-32 processors that support Hyper-Threading Technology.
Page 374: Avoid Coding Pitfalls In Thread Synchronization
IA-32 Intel® Architecture Optimization Avoid Coding Pitfalls in Thread Synchronization Synchronization between multiple threads must be designed and implemented with care to achieve good performance scaling with respect to the number of discrete processors and the number of logical processor per physical processor. No single technique is a universal solution for every synchronization situation.
Page 375: Example 7-5 Coding Pitfall Using Spin Wait Loop
Example 7-5 Coding Pitfall using Spin Wait Loop (a) A spin-wait loop attempts to release the processor incorrectly. It experiences a performance penalty if the only worker thread and the control thread runs on the same physical processor package. // Only one worker thread is running, // the control loop waits for the worker thread to complete.
Page 376: Prevent Sharing Of Modified Data And False-Sharing
Prevent Sharing of Modified Data and False-Sharing On an Intel Core Duo processor, sharing of modified data incurs a performance penalty when a thread running on one core tries to read or write data that is currently present in modified state in the first level cache of the other core.
Page 377: Placement Of Shared Synchronization Variable
User/Source Coding Rule 24. (H impact, M generality) Beware of false sharing within a cache line (64 bytes on Intel Pentium 4, Intel Xeon, Pentium M, Intel Core Duo processors), and within a sector (128 bytes on Pentium 4 and Intel Xeon processors).
Page 378: Example 7-7 Declaring Synchronization Variables Without Sharing A Cache Line
• Objects allocated dynamically by different threads may share cache lines. Make sure that the variables used locally by one thread are allocated in a manner to prevent sharing the cache line with other threads. Example 7-6 Placement of Synchronization and Regular Variables regVar;...
Page 379: System Bus Optimization
• In managed environments that provide automatic object allocation, the object allocators and garbage collectors are responsible for layout of the objects in memory so that false sharing through two objects does not happen. • Provide classes such that only one thread writes to each object field and close object fields, in order to avoid false sharing.
Page 380: Conserve Bus Bandwidth
Conserve Bus Bandwidth In a multi-threading environment, bus bandwidth may be shared by memory traffic originated from multiple bus agents (These agents can be several logical processors and/or several processor cores). Preserving the bus bandwidth can improve processor scaling performance. Also, effective bus bandwidth typically will decrease if there are significant large-stride cache-misses.
Page 381: Understand The Bus And Cache Interactions
Be careful when parallelizing code sections with data sets that results in the total working set exceeding the second-level cache and /or consumed bandwidth exceeding the capacity of the bus. On an Intel Core Duo processor, if only one thread is using the second-level cache...
Page 382: Avoid Excessive Software Prefetches
Avoid Excessive Software Prefetches Pentium 4 and Intel Xeon Processors have an automatic hardware prefetcher. It can bring data and instructions into the unified second-level cache based on prior reference patterns. In most situations, the hardware prefetcher is likely to reduce system memory latency without explicit intervention from software prefetches.
Page 383: Use Full Write Transactions To Achieve Higher Data Rate
Multi-Core and Hyper-Threading Technology latency of scattered memory reads can be improved by issuing multiple memory reads back-to-back to overlap multiple outstanding memory read transactions. The average latency of back-to-back bus reads is likely to be lower than the average latency of scattered reads interspersed with other bus transactions.
Page 384: Memory Optimization
Frequently, multiple partial writes to WC memory can be combined into full-sized writes using a software write-combining technique to separate WC store operations from competing with WB store traffic. To implement software write-combining, uncacheable writes to memory with the WC attribute are written to a small, temporary buffer (WB type) that fits in the first level data cache.
Page 385: Shared-Memory Optimization
Multi-Core and Hyper-Threading Technology block size for loop blocking should be determined by dividing the target cache size by the number of logical processors available in a physical processor package. Typically, some cache lines are needed to access data that are not part of the source or destination buffers used in cache blocking, so the block size can be chosen between one quarter to one half of the target cache (see also, Chapter 3).
Page 386: Batched Producer-Consumer Model
Figure 7-5, is to minimize bus traffic while sharing data between the producer and the consumer using a shared second-level cache. On an Intel Core Duo processor and when the work buffers are small enough to fit within the first-level cache, re-ordering of producer and consumer tasks are necessary to achieve optimal performance.
Page 387: Example 7-8 Batched Implementation Of The Producer Consumer Threads
Example 7-8 shows the batched implementation of the producer and consumer thread functions. Example 7-8 Batched Implementation of the Producer Consumer Threads void producer_thread() int iter_num = workamount - batchsize; int mode1; for (mode1=0; mode1 < batchsize; mode1++) produce(buffs[mode1],count); } while (iter_num--) Signal(&signal1,1);...
Page 388: Eliminate 64-Kbyte Aliased Data Accesses
Pentium 4 processor performance monitoring events. Appendix B includes an updated list of Pentium 4 processor performance metrics. These metrics are based on events accessed using the Intel VTune performance analyzer. Performance penalties associated with 64 KB aliasing are applicable mainly to current processor implementations of Hyper-Threading Technology or Intel NetBurst microarchitecture.
Page 389: Preventing Excessive Evictions In First-Level Data Cache
Preventing Excessive Evictions in First-Level Data Cache Cached data in a first-level data cache are indexed to linear addresses but physically tagged. Data in second-level and third-level caches are tagged and indexed to physical addresses. While two logical processors in the same physical processor package execute in separate linear address space, the same processors can reference data at the same linear address in two address spaces but mapped to different physical addresses.
Page 390: Per-Thread Stack Offset
(when using IA-32 processors supporting Hyper-Threading Technology). For parallel applications written to run with OpenMP, the OpenMP runtime library in Intel KAP/Pro Toolset automatically provides the stack offset adjustment for each thread. 7-44 Example 7-9 shows a code fragment...
Page 391: Example 7-9 Adding An Offset To The Stack Pointer Of Three Threads
Example 7-9 Adding an Offset to the Stack Pointer of Three Threads Void Func_thread_entry(DWORD *pArg) {DWORD StackOffset = *pArg; DWORD var1; // The local variable at this scope may not benefit DWORD var2; // from the adjustment of the stack pointer that ensue. // Call runtime library routine to offset stack pointer.
Page 392: Per-Instance Stack Offset
Example 7-9 Adding an Offset to the Stack Pointer of Three Threads (Contd.) { DWORD Stack_offset, ID_Thread1, ID_Thread2, ID_Thread3; Stack_offset = 1024; // Stack offset between parent thread and the first child thread. ID_Thread1 = CreateThread(Func_thread_entry, &Stack_offset); // Call OS thread API. Stack_offset = 2048;...
Page 393: Example 7-10 Adding A Pseudo-Random Offset To The Stack Pointer In The Entry Function
However, the buffer space does enable the first-level data cache to be shared cooperatively when two copies of the same application are executing on the two logical processors in a physical processor package. To establish a suitable stack offset for two instances of the same application running on two logical processors in the same physical processor package, the stack pointer can be adjusted in the entry function of the application using the technique shown in Example 7-10.
Page 394: Front-End Optimization
For dual-core processors where the second-level unified cache is shared by two processor cores (e.g. Intel Core Duo processor), multi-threaded software should consider the increase in code working set due to two threads fetching code from the unified cache as part of front-end and cache optimization.
Page 395: Optimization For Code Size
On Hyper-Threading-Technology-enabled processors, excessive loop unrolling is likely to reduce the Trace Cache’s ability to deliver high bandwidth μop streams to the execution engine. Optimization for Code Size When the Trace Cache is continuously and repeatedly delivering μop traces that are pre-built, the scheduler in the execution engine can dispatch μops for execution at a high rate and maximize the utilization of available execution resources.
Page 396 APIC_ID (See Section 7.10 of IA-32 Intel Architecture Software Developer’s Manual, Volume 3A for more details) associated with a logical processor. The three levels are: • physical processor package. A PACKAGE_ID label can be used to distinguish different physical packages within a cluster.
Page 397: Example 7-11 Assembling 3-Level Ids, Affinity Masks For Each Logical Processor
Affinity masks can be used to optimize shared multi-threading resources. Example 7-11 Assembling 3-level IDs, Affinity Masks for Each Logical Processor // The BIOS and/or OS may limit the number of logical processors // available to applications after system boot. // The below algorithm will compute topology for the logical processors // visible to the thread that is computing it.
Page 398 Example 7-11 Assembling 3-level IDs, Affinity Masks for Each Logical Processor (Contd.) if (ThreadAffinityMask & SystemAffinity){ Set thread to run on the processor specified in ThreadAffinityMask. Wait if necessary and ensure thread is running on specified processor. apic_conf[ProcessorNum].initialAPIC_ID = GetInitialAPIC_ID(); Extract the Package, Core and SMT ID as explained in three level extraction algorithm.
Page 399 Multi-Core and Hyper-Threading Technology first to the primary logical processor of each processor core. This example is also optimized to the situations of scheduling two memory-intensive threads to run on separate cores and scheduling two compute-intensive threads on separate cores. User/Source Coding Rule 39.
Page 400: Example 7-12 Assembling A Look Up Table To Manage Affinity Masks And Schedule Threads To Each Core First
Example 7-12 Assembling a Look up Table to Manage Affinity Masks and Schedule Threads to Each Core First AFFINITYMASK LuT[64]; // A Lookup table to retrieve the affinity // mask we want to use from the thread // scheduling sequence index. int index =0;...
Page 401: Example 7-13 Discovering The Affinity Masks For Sibling Logical Processors Sharing The Same Cache
Example 7-13 Discovering the Affinity Masks for Sibling Logical Processors Sharing the Same Cache // Logical processors sharing the same cache can be determined by bucketing // the logical processors with a mask, the width of the mask is determined // from the maximum number of logical processors sharing that cache level.
Page 402 Example 7-13 Discovering the Affinity Masks for Sibling Logical Processors Sharing the Same Cache (Contd.) PackageID[ProcessorNUM] = PACKAGE_ID; CoreID[ProcessorNum] = CORE_ID; SmtID[ProcessorNum] = SMT_ID; CacheID[ProcessorNUM] = CACHE_ID; // Only the target cache is stored in this example ProcessorNum++; ThreadAffinityMask <<= 1; NumStartedLPs = ProcessorNum;...
Page 403 Example 7-13 Discovering the Affinity Masks for Sibling Logical Processors Sharing the Same Cache (Contd.) For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) { ProcessorMask << = 1; For (i = 0; i < CacheNum; i++) { // We may be comparing bit-fields of logical processors // residing in a different modular boundary of the cache // topology, the code below assume symmetry across this // modular boundary.
Page 404 Processor topology and an algorithm for software to identify the processor topology are discussed in the IA-32 Intel® Architecture Software Developer’s Manual, Volume 3A. Typically the bus system is shared by multiple agents at the SMT level and at the processor core level of the processor topology.
Page 405: Using Shared Execution Resources In A Processor Core
Such performance metrics are described in Appendix B and can be accessed using the Intel VTune Performance Analyzer. An event ratio like non-halted cycles per instructions retired (non-halted CPI) and non-sleep CPI can be useful in directing code-tuning efforts.
Page 406 Non-halted CPI can correlate to the resource utilization of an application thread, if the application thread is affinitized to a fixed logical processor. 10. In current implementations of processors based on Intel NetBurst microarchitecture, the theoretical lower bound for either non-halted CPI or non-sleep CPI is 1/3. Practical applications rarely achieve any value close to the lower bound.
Page 407 Multi-Core and Hyper-Threading Technology Using a function decomposition threading model, a multithreaded application can pair up a thread with critical dependence on a low-throughput resource with other threads that do not have the same dependency. User/Source Coding Rule 40. (M impact, L generality) If a single thread consumes half of the peak bandwidth of a specific execution unit (e.g.
Page 408 IA-32 Intel® Architecture Optimization Write-combining buffers are another example of execution resources shared between two logical processors. With two threads running simultaneously on a processor supporting Hyper-Threading Technology, s of both threads count toward the limit of four write write-combining buffers. For example: if an inner loop that writes to three separate areas of memory per iteration is run by two threads simultaneously, the total number of cache lines written could be six.
Page 409: 64-Bit Mode Coding Guidelines
64-bit Mode Coding Guidelines Introduction This chapter describes coding guidelines for application software written to run in 64-bit mode. These guidelines should be considered as an addendum to the coding guidelines described in Chapter 2 through 7. Software that runs in either compatibility mode or legacy non-64-bit modes should follow the guidelines described in Chapter 2 through 7.
Page 410: Use Extra Registers To Reduce Register Pressure
This optimization holds true for the lower 8 general purpose registers: EAX, ECX, EBX, EDX, ESP, EBP, ESI, EDI. To access the data in registers r9-r15, the REX prefix is required. Using the 32-bit form there does not reduce code size. Assembly/Compiler Coding rule Use the 32-bit versions of instructions in 64-bit mode to reduce code size unless the 64-bit version is necessary to access 64-bit data or additional...
Page 411: Sign Extension To Full 64-Bits
If the compiler can determine at compile time that the result of a multiply will not exceed 64 bits, then the compiler should generate the multiply instruction that produces a 64-bit result. If the compiler or assembly programmer can not determine that the result will be less than 64 bits, then a multiply that produces a 128-bit result is necessary.
Page 412: Alternate Coding Rules For 64-Bit Mode
Can be replaced with: movsx r8, r9w movsx r8, r10b In the above example, the moves to r8w and r8b both require a merge to preserve the rest of the bits in the register. There is an implicit real dependency on r8 between the 'mov r8w, r9w' and 'mov r8b, r10b'. Using movsx breaks the real dependency and leaves only the output dependency, which the processor can eliminate through renaming.
Page 413 IMUL RAX, RCX The 64-bit version above is more efficient than using the following 32-bit version: MOV EAX, DWORD PTR[X] MOV ECX, DWORD PTR[Y] IMUL ECX In the 32-bit case above, EAX is required to be a source. The result ends up in the EDX:EAX pair instead of in a single 64-bit register.
Page 414: Use 32-Bit Versions Of Cvtsi2Ss And Cvtsi2Sd When Possible
Use the 32-bit versions of CTVSI2SS and CVTSI2SD when possible. Using Software Prefetch Intel recommends that software developers follow the recommendations in Chapter 2 and Chapter 6 when considering the choice of organizing data access patterns to take advantage of the hardware prefetcher (versus using software prefetch).
Page 415: Chapter 9 Power Optimization For Mobile Usages
P-states to facilitate management of active power consumption; and several C-state types static power consumption. Power saving techniques applicable to mobile platforms, such as Intel Centrino mobile technology or Intel Centrino Duo mobile technology, have rich subjects; only processor-related techniques are covered in this manual.
Page 416: Mobile Usage Scenarios
Pentium M, Intel Core Solo and Intel Core Duo processors implement features designed to enable the reduction of active power and static power consumption. These include: • Enhanced Intel SpeedStep (OS) to program a processor to transition to lower frequency and/or voltage levels while executing a workload.
Page 417: Figure 9-1 Performance History And State Transitions
to accommodate demand and adapt power consumption. The interaction between the OS power management policy and performance history is described below: Demand is high and the processor works at its highest possible frequency (P0). Demand decreases, which the OS recognizes after some delay; the OS sets the processor to a lower frequency (P1).
Page 418: Acpi C-States
ACPI C-States When computational demands are less than 100%, part of the time the processor is doing useful work and the rest of the time it is idle. For example, the processor could be waiting on an application time-out set by a Sleep() function, waiting for a web server response, or waiting for a user mouse click.
Page 419 The index of a C-state type designates the depth of sleep. Higher numbers indicate a deeper sleep state and lower power consumption. They also require more time to wake up (higher exit latency). C-state types are described below: • C0: The processor is active and performing computations and executing instructions.
Page 420: Processor-Specific C4 And Deep C4 States
L2 cache to maintain its state. Pentium M processor can be detected by CPUID signature with family 6, model 9 or 13, Intel Core Solo and Intel Core Duo processor has CPUID signature with family 6, model 14. provide...
Page 421: Guidelines For Extending Battery Life
• In an Intel Core Solo or Duo processor, after staying in C4 for an extended time, the processor may enter into a Deep C4 state to save additional static power.. The processor reduces voltage to the minimum level required to safely maintain processor context.
Page 422: Adjust Performance To Meet Quality Of Features
Adjust Performance to Meet Quality of Features When a system is battery powered, applications can extend battery life by reducing the performance or quality of features, turning off background activities, or both. Implementing such options in an application increases the processor idle time. Processor power consumption when idle is significantly lower than when active, resulting in longer battery life.
Page 423: Reducing Amount Of Work
PeekMessage(). Use WaitMessage() to suspend the thread until a message is in the queue. ® Intel Mobile Platform Software Development Kit of APIs for mobile software to manage and optimize power consumption of mobile processor and other components in the platform.
Page 424: Platform-Level Optimizations
(usually that equates to reducing the number of instructions that the processor needs to execute, or optimizing application performance). Optimizing an application starts with having efficient algorithms and then improving them using Intel software development tools, such as ® Intel VTune™ Performance Analyzers, Intel Performance Libraries.
Page 425: Handling Sleep State Transitions
disk operations over time. Use the GetDevicePowerState() Windows API to test disk state and delay the disk access if it is not spinning. Handling Sleep State Transitions In some cases, transitioning to a sleep state may harm an application. For example, suppose an application is in the middle of using a file on the network when the system enters suspend mode.
Page 426: Using Enhanced Intel Speedstep ® Technology
Using Enhanced Intel SpeedStep Use Enhanced Intel SpeedStep Technology to adjust the processor to operate at a lower frequency and save energy. The basic idea is to divide computations into smaller pieces and use OS power management policy to effect a transition to higher P-states.
Page 427 Power Optimization for Mobile Usages The same application can be written in such a way that work units are divided into smaller granularity, but scheduling of each work unit and Sleep() occurring at more frequent intervals (e.g. 100 ms) to deliver the same QOS (operating at full performance 50% of the time).
Page 428: Enabling Intel ® Enhanced Deeper Sleep
Instead, use longer idle periods to allow the processor to enter a deeper low power mode. ® Enabling Intel Enhanced Deeper Sleep In typical mobile computing usages, the processor is idle most of the time.
Page 429: Multi-Core Considerations
C-state type. The lower-numbered state type is usually C2, but may even be C0. The situation is significantly improved in the Intel Core Solo processor (compared to previous generations of the Pentium M processors), but polling will likely prevent the processor from entering into highest-numbered, processor-specific C-state.
Page 430: Thread Migration Considerations
IA-32 Intel® Architecture Optimization thread enables the physical processor to operate at lower frequency relative to a single-threaded version. This in turn enables the processor to operate at a lower voltage, saving battery life. Note that the OS views each logical processor or core in a physical processor as a separate entity and computes CPU utilization independently for each logical processor or core.
Page 431: Multi-Core Considerations For C-States
demands only 50% of processor resources (based on idle history). The processor frequency may be reduced by such multi-core unaware P-state coordination, resulting in a performance anomaly. See Figure 9-5: Figure 9-5 Thread Migration in a Multi-Core Processor active Core 1 Idle active Core 2...
Page 432: Figure 9-6 Progression To Deeper Sleep
Thread 2 (core 2) Sleep Active Sleep 2. Enabling both core to take advantage of Intel Enhanced Deeper Sleep: To best utilize processor-specific C-state (e.g., Intel Enhanced Deeper Sleep) to conserve battery life in multithreaded applications, a multi-threaded applications should synchronize threads to work simultaneously and sleep simultaneously using OS synchronization primitives.
Page 433 Intel Core Duo processor provides an event for this purpose. The event (Serial_Execution_Cycle) increments under the following conditions: — The core is actively executing code in C0 state, — The second core in the physical processor is in an idle state (C1-C4).
Page 434 IA-32 Intel® Architecture Optimization 9-20...
Page 435: Appendix A Application Performance Tools
Application Performance Tools Intel offers an array of application performance tools that are optimized to take advantage of the Intel architecture (IA)-based processors. This appendix introduces these tools and explains their capabilities for developing the most efficient programs without having to write assembly code.
Page 436: Intel ® Compilers
Microsoft .NET IDE. In Linux environment, the Intel C++ Compilers are binary compatible with the corresponding version of gcc. The Intel C++ compiler may be used from the Borland* IDE, or standalone, like the Fortran compiler. All compilers allow you to optimize your code by using special optimization options described in this section.
Page 437: Code Optimization Options
Vectorization, processor dispatch, inter-procedural optimization, profile-guided optimization and OpenMP parallelism are all supported by the Intel compilers and can significantly aid the performance of an application. The most general optimization options are them enables a number of specific optimization options. In most cases,...
Page 438: Automatic Processor Dispatch Support (-Qx[Extensions] And -Qax[Extensions
Code produced will run on any Intel architecture 32-bit processor, but will be optimized specifically for the targeted processor. Automatic Processor Dispatch Support (-Qx[extensions] and -Qax[extensions]) -Qx[extensions] support to generate code that is specific to processor-instruction extensions. The corresponding options on Linux are -x[extensions] and -ax[extensions].
Page 439: Vectorizer Switch Options
Vectorizer Switch Options The Intel C++ and Fortran Compiler can vectorize your code using the vectorizer switch options. The options that enable the vectorizer are -Qx[M,K,W,B,P] and -Qax[M,K,W,B,P] compiler provides a number of other vectorizer switch options that allow you to control vectorization.
Page 440: Multithreading With Openmp
Multithreading with OpenMP* Both the Intel C++ and Fortran Compilers support shared memory parallelism via OpenMP compiler directives, library functions and environment variables. OpenMP directives are activated by the compiler switch -Qopenmp User's Guides available with the Intel C++ and Fortran Compilers.
Page 441: Interprocedural And Profile-Guided Optimizations
Profile-guided optimization is particularly beneficial for the Pentium 4 and Intel Xeon processor family. It greatly enhances the optimization decisions the compiler makes regarding instruction cache utilization and memory paging. Also, because PGO uses execution-time information to...
Page 442: Intel ® Vtune™ Performance Analyzer
Repeat the instrumentation compilation if you make many changes to your source files after execution and before feedback compilation. For further details on the interprocedural and profile-guided optimizations, refer to the Intel C++ Compiler User’s Guide. ® Intel VTune™ Performance Analyzer The Intel VTune Performance Analyzer is a powerful software-profiling tool for Microsoft Windows and Linux.
Page 443: Sampling
Sampling Sampling allows you to profile all active software on your system, including operating system, device driver, and application software. It works by occasionally interrupting the processor and collecting the instruction address, process ID, and thread ID. After the sampling activity completes, the VTune analyzer displays the data by process, thread, software module, function, or line of source.
Page 444: Event-Based Sampling
The VTune analyzer indicates where micro architectural events, specific to the Pentium 4, Pentium M and Intel Xeon processors, occur the most often. On Pentium M processors, the VTune analyzer can collect two...
Page 445: Workload Characterization
Application Performance Tools different events at a time. The number of the events that the VTune analyzer can collect at once on the Pentium 4 and Intel Xeon processor depends on the events selected. Event-based samples are collected after a specific number of processor events have occurred.
Page 446 Hardware prefetch mechanisms can be controlled on demand using the model-specific register IA32_MISC_ENABLES. See Appendix B of the IA-32 Intel® Architecture Software Developer’s Manual, Volume 3B describes the specific bit locations of the IA32_MISC_ENABLES MSR.
Page 447: Call Graph
Application Performance Tools stride inefficiency is most prominent on memory traffic. A useful indicator for large-stride inefficiency in a workload is to compare the ratio between bus read transactions and the number of DTLB pagewalks due to read traffic, under the condition of disabling the hardware prefetch while measuring bus traffic of the workload.
Page 448: Counter Monitor
® The Intel Tuning Assistant can generate tuning advice based on counter monitor and sampling data. You can invoke the Intel Tuning Assistant from the source, counter monitor, or sampling views by clicking on the Intel Tuning Assistant icon. ®...
Page 449: Benefits Summary
LAPACK and BLAS, Discrete Fourier Transforms (DFT), vector transcendental functions (vector math library/VML) and vector statistical functions (VSL). Intel MKL is optimized for the latest features and capabilities of the Intel Pentium 4 processor, Pentium M processor, Intel Xeon processors ®...
Page 450: Optimizations With The Intel ® Performance Libraries
MKL and IPP functions are safe for use in a threaded environment. Optimizations with the Intel The Intel Performance Libraries implement a number of optimizations that are discussed throughout this manual. Examples include architecture-specific tuning such as loop unrolling, instruction pairing and scheduling;...
Page 451: Enhanced Debugger (Edb
Intel Performance Libraries benefit from new architectural features of future generations of Intel processors simply by relinking the application with upgraded versions of the libraries. Enhanced Debugger (EDB) The Enhanced Debugger (EDB) enables you to debug C++, Fortran or mixed language programs running under Windows NT* or Windows 2000 (not Windows 98).
Page 452: Figure A-2 Intel Thread Checker Can Locate Data Race Conditions
The Intel Thread Checker product is an Intel VTune Performance Analyzer plug-in data collector that executes your program and automatically locates threading errors. As your program runs, the Intel Thread Checker monitors memory accesses and other events and automatically detects situations which could cause unpredictable threading-related results.
Page 453: Thread Profiler
Thread Profiler The thread profiler is a plug-in data collector for the Intel VTune Performance Analyzer. Use it to analyze threading performance and identify parallel performance problems. The thread profiler graphically illustrates what each thread is doing at various levels of detail using a hierarchical summary.
Page 454: Intel ® Software College
Figure A-3 Intel Thread Profiler Can Show Critical Paths of Threaded Execution Timelines ® Intel Software College ® The Intel Software College is a valuable resource for classes on Streaming SIMD Extensions 2 (SSE2), Threading and the IA-32 Intel Architecture. For online training on how to use the SSE2 and...
Page 455: Appendix B Using Performance Monitoring Events
The descriptions of the Intel Pentium 4 processor performance metrics use terminology that are specific to the Intel NetBurst microarchitecture and to the implementation in the Pentium 4 and Intel Xeon processors. The following sections explain the terminology specific to Pentium 4...
Page 456: Pentium 4 Processor-Specific Terminology
Branch mispredictions incur a large penalty on microprocessors with deep pipelines. In general, the direction of branches can be predicted with a high degree of accuracy by the front end of the Intel Pentium 4 processor, such that most computations can be performed along the predicted path while waiting for the resolution of the branch.
Page 457: Replay
Replay In order to maximize performance for the common case, the Intel NetBurst microarchitecture sometimes aggressively schedules execution before all the conditions for correct execution are guaranteed to be satisfied. In the event that all of these conditions are not satisfied, μ...
Page 458: Counting Clocks
miss more than once during its life time, but a Misses Retired metric (for example, 1 μ for that Counting Clocks The count of cycles, also known as clock ticks, forms a fundamental basis for measuring how long a program takes to execute, and as part of efficiency ratios like cycles per instruction (CPI).
Page 459: Non-Halted Clockticks
The first two metrics use performance counters, and thus can be used to cause interrupt upon overflow for sampling. They may also be useful for those cases where it is easier for a tool to read a performance counter instead of the time stamp counter. The timestamp counter is accessed via an instruction, RDTSC.
Page 460: Non-Sleep Clockticks
Non-Sleep Clockticks The performance monitoring counters can also be configured to count clocks whenever the performance monitoring hardware is not powered-down. To count “non-sleep clockticks” with a performance-monitoring counter, do the following: • Select any one of the 18 counters. •...
Page 461: Time Stamp Counter
Using Performance Monitoring Events that logical processor is not halted (it may include some portion of the clock cycles for that logical processor to complete a transition into a halted state). A physical processor that supports Hyper-Threading Technology enters into a power-saving state if all logical processors are halted.
Page 462: Microarchitecture Notes
Microarchitecture Notes Trace Cache Events The trace cache is not directly comparable to an instruction cache. The two are organized very differently. For example, a trace can span many lines' worth of instruction-cache data. As with most microarchitectural elements, trace cache performance is only an issue if something else is not a bigger bottleneck.
Page 463 Using Performance Monitoring Events There is a simplified block diagram below of the sub-systems connected to the IOQ unit in the front side bus sub-system and the BSQ unit that interface to the IOQ. A two-way SMP configuration is illustrated. 1st-level cache misses and writebacks (also called core references) result in references to the 2nd-level cache.
Page 464: Figure B-1 Relationships Between The Cache Hierarchy, Ioq, Bsq And Front Side Bus
Figure B-1 Relationships Between the Cache Hierarchy, IOQ, BSQ and Front Side Bus System Memory B-10 1st Level Data Unified 2nd Level Cache 3rd Level Cache Chip Set 3rd Level Cache 1st Level Data Unified 2nd Level Cache Cache FSB_ IOQ FSB_ IOQ Cache...
Page 465: Reads Due To Program Loads
The granularities of core references are listed below, according to the performance monitoring events that are docu- mented in Appendix A of the IA-32 Intel® Architecture Software Devel- oper’s Manual, Volume 3B. Reads due to program loads •...
Page 466: Writebacks (Dirty Evictions
• IOQ_allocation, IOQ_active_entries: 64 bytes for hits or misses, smaller for partials' hits or misses Writebacks (dirty evictions) • BSQ_cache_reference: 64 bytes • BSQ_allocation: 64 bytes • BSQ_active_entries: 64 bytes • IOQ_allocation, IOQ_active_entries: 64 bytes The count of IOQ allocations may exceed the count of corresponding BSQ allocations on current implementations for several reasons, including: •...
Page 467: Usage Notes For Specific Metrics
2nd-level cache, and the 3rd-level cache if present. But due to the current implementation of BSQ_cache_reference in Pentium 4 and Intel Xeon processors, they should not be used to calculate cache hit rates or cache miss rates. The following three paragraphs describe some of the issues related to BSQ_cache_reference, so that its results can be better interpreted.
Page 468 64-byte granularity. Prefetches themselves are not counted as either hits or misses, as of Pentium 4 and Intel Xeon processors with a CPUID signature of 0xf21. However, in Pentium 4 Processor implementations with a CPUID signature of 0xf07 and earlier have the problem that reads to lines that are already being prefetched are counted as hits in addition to misses, thus overcounting hits.
Page 469: Usage Notes On Bus Activities
That memory performance change may or may not be reflected in the measured FSB latencies. Also note that for Pentium 4 and Intel Xeon Processor implementations with an integrated 3rd-level cache, BSQ entries are allocated for all 2nd-level writebacks (replaced lines), not just those that become bus...
Page 470: Metrics Descriptions And Categories
BSQ entries due to such references will become bus transactions. Metrics Descriptions and Categories The Performance metrics for Intel Pentium 4 and Intel Xeon processors are listed in Table B-1. These performance metrics consist of recipes to program specific Pentium 4 and Intel Xeon processor performance monitoring events to obtain event counts that represent one of the following: number of instructions, cycles, or occurrences.
Page 471 The additional sub-event information is included in column 3 as various tags, which are described in “Performance Metrics and Tagging Mechanisms”. For event names that appear in this column, refer to the IA-32 Intel® Architecture Software Developer’s Manual, Volumes 3A & 3B. •...
Page 472: Table B-1 Pentium 4 Processor Performance Metrics
Table B-1 Pentium 4 Processor Performance Metrics Metric Description General Metrics Non-Sleep The number of Clockticks clockticks.while a processor is not in any sleep modes. Non-Halted The number of Clockticks clockticks that the processor is in not halted nor in sleep. Instructions Non-bogus IA-32 Retired...
Page 473 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Speculative Number of uops Uops Retired retired (include both instructions executed to completion and speculatively executed in the path of branch mispredictions). Branching Metrics Branches All branch Retired instructions executed to completion Tagged The events counts...
Page 474 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Mispredicted The number of returns mispredicted returns including all causes. All conditionals The number of branches that are conditional jumps (may overcount if the branch is from build mode or there is a machine clear near the branch) Mispredicted...
Page 475 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description TC Flushes Number of TC flushes (The counter will count twice for each occurrence. Divide the count by 2 to get the number of flushes.) Logical The number of Processor 0 cycles that the trace Deliver Mode and delivery engine...
Page 476 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Logical The number of Processor 1 cycles that the trace Deliver Mode and delivery engine (TDE) is delivering traces associated with logical processor 1, regardless of the operating modes of the TDE for traces associated with logical processor 0.
Page 477 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Logical The number of Processor 0 cycles that the trace Build Mode and delivery engine (TDE) is building traces associated with logical processor 0, regardless of the operating modes of the TDE for traces associated with logical processor 1.
Page 478 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Trace Cache The number of times Misses that significant delays occurred in order to decode instructions and build a trace because of a TC miss. TC to ROM Twice the number of Transfers times that the ROM microcode is...
Page 479 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Memory Metrics Page Walk The number of page DTLB All walk requests due to Misses DTLB misses from either load or store. -Level Cache The number of retired μops that Load Misses Retired experienced...
Page 480 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description 64K Aliasing The number of 64K Conflicts aliasing conflicts. A memory reference causing 64K aliasing conflict can be counted more than once in this stat. The performance penalty resulted from 64K-aliasing conflict can vary from being unnoticeable to...
Page 481 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description MOB Load The number of Replays replayed loads related to the Memory Order Buffer (MOB). This metric counts only the case where the store-forwarding data is not an aligned subset of the stored data.
Page 482 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description 2nd-Level The number of Cache Reads 2nd-level cache read Hit Shared references (loads and RFOs) that hit the cache line in shared state. Beware of granularity differences. 2nd-Level The number of Cache Reads 2nd-level cache read Hit Modified...
Page 483 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description 3rd-Level The number of Cache Reads 3rd-level cache read Hit Modified references (loads and RFOs) that hit the cache line in modified state. Beware of granularity differences. 3rd-Level The number of Cache Reads 3rd-level cache read Hit Exclusive...
Page 484 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description All WCB The number of times Evictions a WC buffer eviction occurred due to any causes (This can be used to distinguish 64K aliasing cases that contribute more significantly to performance penalty, e.g., stores that are 64K aliased.
Page 485 Beware of granularity issues with this event. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. Using Performance Monitoring Events...
Page 486 Beware of granularity issues with this event. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. B-32 Event Name or Metric Expression (Bus Accesses –...
Page 487 RFOs). Beware of granularity issues with this event. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. Reads The number of all...
Page 488 Beware of granularity issues with this event. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. All UC from the...
Page 489 “Bus Accesses from the processor” to get bus request latency. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. Using Performance Monitoring Events...
Page 490 Non-prefetch read request latency. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. B-36 Event Name or Metric...
Page 491 Divide by “All UC from the processor” to get UC request latency. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2.
Page 492 “Writes from the Processor” to get bus write request latency. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. Bus Accesses...
Page 493 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Write WC Full The number of write (BSQ) (but neither writeback nor RFO) transactions to WC-type memory. Write WC The number of Partial (BSQ) partial write transactions to WC-type memory. User note: This event may undercount WC partials that originate...
Page 494 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Reads The number of read Non-prefetch (excludes RFOs and Full (BSQ) HW|SW prefetches) transactions to WB-type memory. Beware of granularity issues with this event. Reads The number of read Invalidate Full- invalidate (RFO) RFO (BSQ) transactions to...
Page 495 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description UC Write The number of UC Partial (BSQ) write transactions. Beware of granularity issues between BSQ and FSB IOQ events. IO Reads The number of Chunk (BSQ) 8-byte aligned IO port read transactions.
Page 496 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description WB Writes Full This is an accrued Underway sum of the durations (BSQ) of writeback (evicted from cache) transactions to WB-type memory. Divide by Writes WB Full (BSQ) to estimate average request latency.
Page 497 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Write WC Partial This is an accrued Underway sum of the durations (BSQ) of partial write transactions to WC-type memory. Divide by Write WC Partial (BSQ) to estimate average request latency. User note: Allocated entries of WC partials that originate...
Page 498 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description SSE Input The number of Assists occurrences of SSE/SSE2 floating-point operations needing assistance to handle an exception condition. The number of occurrences includes speculative counts. Packed SP Non-bogus packed Retired single-precision instructions retired.
Page 499 Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Stalled Cycles The duration of stalls of Store Buffer due to lack of store Resources buffers. (non-standard Stalls of Store The number of Buffer allocation stalls due Resources to lack of store (non-standard buffers.
Page 500: Performance Metrics And Tagging Mechanisms
Compare Edge split_load_retired to count at retirement. This section replay_event front_end_event . Please refer to Appendix A of the IA-32 Intel® μ ops at retirement using the μ ops so they can be detected at retirement. Some μ ops. The event names referenced in μ...
Page 501: Table B-2 Metrics That Utilize Replay Tagging Mechanism
Table B-2 Metrics That Utilize Replay Tagging Mechanism Bit field to set: IA32_PEBS_ Replay Metric Tags ENABLE Bit 0, BIT 24, 1stL_cache_load_ BIT 25 miss_retired Bit 1, BIT 24, 2ndL_cache_load_ BIT 25 miss_retired Bit 2, BIT 24, DTLB_load_miss_ BIT 25 retired Bit 2, BIT 24, DTLB_store_miss_...
Page 502: Tags For Execution_Event
Tags for front_end_event Table B-3 provides a list of the tags that are used by various metrics derived from the column 2 can be found from the Pentium 4 processor performance monitoring events. Table B-3 Table 3 Metrics That Utilize the Front-end Tagging Mechanism Front-end MetricTags Memory_loads Memory_stores...
Page 503: Table B-4 Metrics That Utilize The Execution Tagging Mechanism
Table B-4 Metrics That Utilize the Execution Tagging Mechanism Execution Metric Tags Packed_SP_retired Scalar_SP_retired Scalar_DP_retired 128_bit_MMX_retired 64_bit_MMX_retired X87_FP_retired Using Performance Monitoring Events Tag Value in Upstream Upstream ESCR ESCR Set the ALL bit in the event mask and the TagUop bit in the ESCR of packed_SP_uop.
Page 504: Using Performance Metrics With Hyper-Threading Technology
Using Performance Metrics with Hyper-Threading Technology On Intel Xeon processors that support Hyper-Threading Technology, the performance metrics listed in Table B-1 may be qualified to associate the counts with a specific logical processor, provided the relevant performance monitoring events supports qualification by logical processor.
Page 505: Table B-6 Metrics That Support Qualification By Logical Processor And Parallel Counting
The performance metrics listed in Table B-1 fall into three categories: • Logical processor specific and supporting parallel counting. • Logical processor specific but constrained by ESCR limitations. • Logical processor independent and not supporting parallel counting. Table B-5 lists performance metrics in the first and second category. Table B-6 lists performance metrics in the third category.
Page 506 Table B-6 Metrics That Support Qualification by Logical Processor and Parallel Counting (continued) Branching Metrics TC and Front End Metrics B-52 Branches Retired Tagged Mispredicted Branches Retired Mispredicted Branches Retired All returns All indirect branches All calls All conditionals Mispredicted returns Mispredicted indirect branches Mispredicted calls Mispredicted conditionals...
Page 507 Table B-6 Metrics That Support Qualification by Logical Processor and Parallel Counting (continued) Memory Metrics Using Performance Monitoring Events Split Load Replays Split Store Replays MOB Load Replays 64k Aliasing Conflicts 1st-Level Cache Load Misses Retired 2nd-Level Cache Load Misses Retired DTLB Load Misses Retired Split Loads Retired Split Stores Retired...
Page 508 Table B-6 Metrics That Support Qualification by Logical Processor and Parallel Counting (continued) Bus Metrics B-54 Bus Accesses from the Processor Non-prefetch Bus Accesses from the Processor Reads from the Processor Writes from the Processor Reads Non-prefetch from the Processor All WC from the Processor All UC from the Processor Bus Accesses from All Agents...
Page 509: Table B-7 Metrics That Are Independent Of Logical Processors
Table B-6 Metrics That Support Qualification by Logical Processor and Parallel Counting (continued) Characterization Metrics Parallel counting is not supported due to ESCR restrictions. Table B-7 Metrics That Are Independent of Logical Processors General Metrics TC and Front End Metrics Memory Metrics Bus Metrics Characterization Metrics...
Page 510: Using Performance Events Of Intel Core Solo And Intel Core Duo Processors
Intel Core Duo processors There are performance events specific to the microarchitecture of Intel Core Solo and Intel Core Duo processors (see Table A-9 of the IA-32 Intel® Architecture Software Developer’s Manual, Volume 3B). Understanding the Results in a Performance Counter Each performance event detects a well-defined microarchitectural condition occurring in the core while the core is active.
Page 511: Ratio Interpretation
There are three cycle-counting events which will not progress on a halted core, even if the halted core is being snooped. These are: Unhalted core cycles, Unhalted reference cycles, and Unhalted bus cycles. All three events are detected for the unit selected by event 3CH. Some events detect microarchitectural conditions but are limited in their ability to identify the originating core or physical processor.
Page 512: Notes On Selected Events
Notes on Selected Events This section provides event-specific notes for interpreting performance events listed in Table A-9 of the IA-32 Intel® Architecture Software Developer’s Manual, Volume 3B. • L2_Reject_Cycles, event number 30H This event counts the cycles during which the L2 cache rejected new access requests.
Page 513 • Serial_Execution_Cycles, event number 3C, unit mask 02H This event counts the bus cycles during which the core is actively executing code (non-halted) while the other core in the physical processor is halted. • L1_Pref_Req, event number 4FH, unit mask 00H This event counts the number of times the Data Cache Unit (DCU) requests to prefetch a data cache line from the L2 cache.
Page 514 IA-32 Intel® Architecture Optimization B-60...
Page 515: Appendix C Ia-32 Instruction Latency And Throughput
IA-32 instructions The instruction timing data varies within the IA-32 family of processors. Only data specific to the Intel Pentium 4, Intel Xeon processors and Intel Pentium M processor are provided. The relevance of instruction throughput and latency information for code tuning is discussed in Chapter 1 and Chapter 2, see “Execution Core Detail”...
Page 516: Overview
Overview The current generation of IA-32 family of processors use out-of-order execution with dynamic scheduling and buffering to tolerate poor instruction selection and scheduling that may occur in legacy code. It can reorder μops to cover latency delays and to avoid resource conflicts. In some cases, the microarchitecture’s ability to avoid such delays can be enhanced by arranging IA-32 instructions.
Page 517 ROM. These instructions with longer μop flows incur a delay in the front end and reduce the supply of uops to the execution core. In Pentium 4 and Intel Xeon processors, transfers to microcode ROM often reduce how efficiently μops can be packed into the trace cache.
Page 518: Definitions
FP_ADD FP_MUL cluster (see Figure 1-4, Figure 1-4 applies FP_EXECUTE to Pentium 4 and Intel Xeon processors with CPUID signature of family 15, model encoding = 0, 1, 2). , or in the MMX_SHFT...
Page 519 All numeric data in the tables are: — approximate and are subject to change in future implementations of the Intel NetBurst microarchitecture or the Pentium M processor microarchitecture. — not meant to be used as reference numbers for comparisons of instruction-level performance benchmarks.
Page 520: Latency And Throughput With Register Operands
Latency and Throughput with Register Operands IA-32 instruction latency and throughput data are presented in Table C-2 through Table C-8. The tables include the Streaming SIMD Extension 3, Streaming SIMD Extension 2, Streaming SIMD Extension, MMX technology and most of commonly used IA-32 instructions. Instruction latency and throughput of the Pentium 4 processor and of the Pentium M processor are given in separate columns.
Page 521: Table C-2 Streaming Simd Extension 2 128-Bit Integer Instructions
Table C-2 Streaming SIMD Extension 2 128-bit Integer Instructions Instruction CPUID CVTDQ2PS xmm, xmm CVTPS2DQ xmm, xmm CVTTPS2DQ xmm, xmm MOVD xmm, r32 MOVD r32, xmm MOVDQA xmm, xmm MOVDQU xmm, xmm MOVDQ2Q mm, xmm MOVQ2DQ xmm, mm MOVQ xmm, xmm PACKSSWB/PACKSSDW/ PACKUSWB xmm, xmm PADDB/PADDW/PADDD...
Page 522 Table C-2 Streaming SIMD Extension 2 128-bit Integer Instructions (continued) Instruction PCMPGTB/PCMPGTD/PC MPGTW xmm, xmm PEXTRW r32, xmm, imm8 PINSRW xmm, r32, imm8 PMADDWD xmm, xmm PMAX xmm, xmm PMIN xmm, xmm PMOVMSKB r32, xmm PMULHUW/PMULHW/ PMULLW xmm, xmm PMULUDQ mm, mm PMULUDQ xmm, xmm POR xmm, xmm PSADBW xmm, xmm...
Page 523: Table C-3 Streaming Simd Extension 2 Double-Precision Floating-Point Instructions
Table C-2 Streaming SIMD Extension 2 128-bit Integer Instructions (continued) Instruction PSUBB/PSUBW/PSUBD xmm, xmm PSUBSB/PSUBSW/PSUBU SB/PSUBUSW xmm, xmm PUNPCKHBW/PUNPCKH WD/PUNPCKHDQ xmm, PUNPCKHQDQ xmm, xmm PUNPCKLBW/PUNPCKLW D/PUNPCKLDQ xmm, xmm PUNPCKLQDQ xmm, PXOR xmm, xmm See “Table Footnotes” Table C-3 Streaming SIMD Extension 2 Double-precision Floating-point Instructions Instruction CPUID...
Page 524 Table C-3 Streaming SIMD Extension 2 Double-precision Floating-point Instructions (continued) Instruction COMISD xmm, xmm CVTDQ2PD xmm, xmm CVTPD2PI mm, xmm CVTPD2DQ xmm, xmm CVTPD2PS xmm, xmm CVTPI2PD xmm, mm CVTPS2PD xmm, xmm CVTSD2SI r32, xmm CVTSD2SS xmm, xmm CVTSI2SD xmm, r32 CVTSS2SD xmm, xmm CVTTPD2PI mm, xmm...
Page 525 Table C-3 Streaming SIMD Extension 2 Double-precision Floating-point Instructions (continued) Instruction DIVPD xmm, xmm DIVSD xmm, xmm MAXPD xmm, xmm MAXSD xmm, xmm MINPD xmm, xmm MINSD xmm, xmm MOVAPD xmm, xmm MOVMSKPD r32, xmm MOVSD xmm, xmm MOVUPD xmm, xmm MULPD xmm, xmm MULSD xmm, xmm ORPD...
Page 526: Table C-4 Streaming Simd Extension Single-Precision Floating-Point Instructions
Table C-4 Streaming SIMD Extension Single-precision Floating-point Instructions Instruction CPUID 0F3n ADDPS xmm, xmm ADDSS xmm, xmm ANDNPS xmm, xmm ANDPS xmm, xmm CMPPS xmm, xmm CMPSS xmm, xmm COMISS xmm, xmm CVTPI2PS xmm, mm CVTPS2PI mm, xmm CVTSI2SS xmm, r32 CVTSS2SI r32, xmm CVTTPS2PI mm, xmm CVTTSS2SI r32, xmm...
Page 527 Table C-4 Streaming SIMD Extension Single-precision Floating-point Instructions (continued) Instruction MOVLHPS xmm, xmm MOVMSKPS r32, xmm MOVSS xmm, xmm MOVUPS xmm, xmm MULPS xmm, xmm MULSS xmm, xmm ORPS xmm, xmm RCPPS xmm, xmm RCPSS xmm, xmm RSQRTPS xmm, xmm RSQRTSS xmm, xmm SHUFPS...
Page 528: Table C-6 Mmx Technology 64-Bit Instructions
Table C-5 Streaming SIMD Extension 64-bit Integer Instructions Instruction CPUID PAVGB/PAVGW mm, mm PEXTRW r32, mm, imm8 PINSRW mm, r32, imm8 PMAX mm, mm PMIN mm, mm PMOVMSKB r32, mm PMULHUW mm, mm PSADBW mm, mm PSHUFW mm, mm, imm8 See “Table Footnotes”...
Page 529 Table C-6 MMX Technology 64-bit Instructions (continued) Instruction PCMPGTB/PCMPGTD/ PCMPGTW mm, mm PMADDWD mm, mm PMULHW/PMULLW mm, mm POR mm, mm PSLLQ/PSLLW/ PSLLD mm, mm/imm8 PSRAW/PSRAD mm, mm/imm8 PSRLQ/PSRLW/PSRLD mm, mm/imm8 PSUBB/PSUBW/PSUBD mm, mm PSUBSB/PSUBSW/PSU BUSB/PSUBUSW mm, PUNPCKHBW/PUNPCK HWD/PUNPCKHDQ mm, mm PUNPCKLBW/PUNPCK LWD/PUNPCKLDQ mm, PXOR mm, mm...
Page 530 Table C-7 IA-32 x87 Floating-point Instructions Instruction CPUID FABS FADD FSUB FMUL FCOM FCHS FDIV Single Precision FDIV Double Precision FDIV Extended Precision FSQRT SP FSQRT DP FSQRT EP F2XM1 FCOS FPATAN FPTAN FSIN FSINCOS FYL2X FYL2XP1 C-16 Latency 0F3n 0F2n 0x69n 0F3n...
Page 531 Table C-7 IA-32 x87 Floating-point Instructions (continued) Instruction FSCALE FRNDINT FXCH FLDZ FINCSTP/FDECSTP See “Table Footnotes” Table C-8 IA-32 General Purpose Instructions Instruction CPUID 0F3n ADC/SBB reg, reg ADC/SBB reg, imm ADD/SUB AND/OR/XOR BSF/BSR BSWAP BTC/BTR/BTS CMP/TEST DEC/INC IMUL r32 IMUL imm32 IMUL IDIV...
Page 532 Table C-8 IA-32 General Purpose Instructions (continued) Instruction LOOP MOVSB/MOVSW MOVZB/MOVZW NEG/NOT/NOP POP r32 PUSH RCL/RCR reg, 1 ROL/ROR SAHF SAL/SAR/SHL/SHR SCAS SETcc STOSB XCHG CALL 66-80 See “Table Footnotes” C-18 Latency Appli- cable 14-18 56-70 Throughput Execution Unit MEM_LOAD, MEM_STORE, MEM_LOAD, ALU,MEM_...
Page 533: Table Footnotes
The names of execution units apply to processor implementations of the Intel NetBurst microarchitecture only with CPUID signature of family 15, model encoding = 0, 1, 2. They include: FP_EXECUTE FPMOVE execution units and ports in the out-of-order core.
Page 534: Latency And Throughput With Memory Operands
Pentium 4 and Intel Xeon processors. Latency and Throughput with Memory Operands The discussion of this section applies to the Intel Pentium 4 and Intel Xeon processors. Typically, instructions with a memory address as the source operand, add one more μop to the “reg, reg” instructions type listed in Table C-1 through C-7.
Page 535 IA-32 Instruction Latency and Throughput For the sake of simplicity, all data being requested is assumed to reside in the first level data cache (cache hit). In general, IA-32 instructions with load operations that execute in the integer ALU units require two more clock cycles than the corresponding register-to-register flavor of the same instruction.
Page 536 IA-32 Intel® Architecture Optimization C-22...
Page 537: Appendix D Stack Alignment
__m128 register spill locations aligned throughout a function invocation.The Intel C++ Compiler for Win32* Systems supports conventions presented here help to prevent memory references from incurring penalties due to misaligned data by keeping them aligned to 16-byte boundaries. In addition, this scheme supports improved...
Page 538 Microsoft-compiled function, for example, can only assume that the frame pointer it used is 4-byte aligned. Earlier versions of the Intel C++ Compiler for Win32 Systems have attempted to provide 8-byte aligned stack frames by dynamically adjusting the stack frame pointer in the prologue of 8-byte alignment of the functions it compiles.
Page 539: Figure D-1 Stack Frames Based On Alignment Type
Figure D-1 Stack Frames Based on Alignment Type ESP-based Aligned Frame Parameters Return Address Padding Register Save Area Local Variables and Spill Slots __cdecl Parameter Passing Space __stdcall Parameter Passing Space As an optimization, an alternate entry point can be created that can be called when proper stack alignment is provided by the caller.
Page 540: Aligned Esp-Based Stack Frames
Stack Alignment Example D-1 in the following sections illustrate this technique. Note the entry points , the latter is the alternate aligned foo.aligned entry point. Aligned esp-Based Stack Frames This section discusses data and parameter alignment and the extended attribute, which can be used to request declspec(align) alignment in C and C++ code.
Page 541: Example D-1 Aligned Esp-Based Stack Frames
Example D-1 Aligned esp-Based Stack Frames void _cdecl foo (int k) int j; foo: push foo.aligned: push common: push j = k; foo(5); call return j; // See Note A ebx, esp esp, 0x00000008 esp, 0xfffffff0 esp, 0x00000008 common ebx, esp // See Note B esp, 20 edx, [ebx + 8]...
Page 542: Aligned Ebp-Based Stack Frames
NOTE. block beginnings are aligned. This places the stack pointer at a 12 mod 16 boundary, as the return pointer has been pushed. Thus, the unaligned entry point must force the stack pointer to this boundary. stack is at an 8 mod 16 boundary, and adds sufficient space to the stack so that the stack pointer is aligned to a 0 mod 16 boundary.
Page 543: Example D-2 Aligned Ebp-Based Stack Frames
Example D-2 Aligned ebp-based Stack Frames void _stdcall foo (int k) int j; foo: push ebx, esp esp, 0x00000008 esp, 0xfffffff0 esp, 0x00000008 after add common foo.aligned: push after push ebx, esp common: push used for push after push ebp, [ebx + 4] and store [esp + 4], ebp ebp, esp...
Page 544 Example D-2 Aligned ebp-based Stack Frames (continued) esp and ebp j = k; edx, [ebx + 8] caller aligned [ebp - 16], edx foo(5); esp, -4 [esp],5 call foo.aligned(5); esp,-16 should [esp],5 call foo.aligned esp,12 return j; eax,[ebp-16] esp,ebp esp,ebx ret 4 Stack Alignment // the goal is to make...
Page 545: Stack Frame Optimizations
16, and thus the caller must account for the remaining adjustment. Stack Frame Optimizations The Intel C++ Compiler provides certain optimizations that may improve the way aligned frames are set up and used. These optimizations are as follows: •...
Page 546: Inlined Assembly And Ebx
(since function’s epilog). For additional information on the use of other related issues, see relevant application notes in the Intel Architecture Performance Training Center. D-10 register generally should not be CAUTION.
Page 547: Appendix E Mathematics Of Prefetch Scheduling Distance
Mathematics of Prefetch Scheduling Distance This appendix discusses how far away to insert prefetch instructions. It presents a mathematical model allowing you to deduce a simplified equation which you can use for determining the prefetch scheduling distance (PSD) for your application. For your convenience, the first section presents this simplified equation;...
Page 548: Mathematical Model For Psd
inst Consider the following example of a heuristic equation assuming that parameters have the values as indicated: ----------------------------------------------------- where 60 corresponds to The values of the parameters in the equation can be derived from the documentation for memory components and chipsets as well as from vendor datasheets.
Page 549: Example E-1 Calculating Insertion For Scheduling Distance Of 3
Note that the potential effects of µop reordering are not factored into the estimations discussed. Examine Example E-1 that uses the prefetch scheduling distance of 3, that is, psd = 3. The data prefetched in iteration i, will actually be used in iteration i+3. T needed to execute while il (iteration latency) represents the cycles needed to execute this loop with actually run-time memory footprint.
Page 550: Figure E-1 Pentium Ii, Pentium Iii And Pentium 4 Processors Memory Pipeline Sketch
Memory access plays a pivotal role in prefetch scheduling. For more understanding of a memory subsystem, consider Streaming SIMD Extensions and Streaming SIMD Extensions 2 memory pipeline depicted in Figure E-1. Figure E-1 Pentium , Pentium III and Pentium 4 Processors Memory Pipeline Sketch Assume that three cache lines are accessed per iteration and four chunks of data are returned per iteration for each cache line.
Page 551 varies dynamically and is also system hardware-dependent. The static variants include the core-to-front-side-bus ratio, memory manufacturer and memory controller (chipset). The dynamic variants include the memory page open/miss occasions, memory accesses sequence, different memory types, and so on. To determine the proper prefetch scheduling distance, follow these steps and formulae: •...
Page 552: No Preloading Or Prefetch
No Preloading or Prefetch The traditional programming approach does not perform data preloading or prefetch. It is sequential in nature and will experience stalls because the memory is unable to provide the data immediately when the execution pipeline requires it. Examine Figure E-2. Figure E-2 Execution Pipeline, No Preloading or Prefetch Execution Execution units idle...
Page 553: Figure E-3 Compute Bound Execution Pipeline
The iteration latency is approximately equal to the computation latency plus the memory leadoff latency (includes cache miss latency, chipset latency, bus arbitration, and so on.) plus the data transfer latency where transfer latency= number of lines per iteration * line burst latency. This means that the decoupled memory and execution are ineffective to explore the parallelism because of flow dependency.
Page 554: Compute Bound (Case:tc >= T L + T B
The following formula shows the relationship among the parameters: It can be seen from this relationship that the iteration latency is equal to the computation latency, which means the memory accesses are executed in background and their latencies are completely hidden. Compute Bound (Case: T Now consider the next case by first examining Figure E-4.
Page 555 Mathematics of Prefetch Scheduling Distance For this particular example the prefetch scheduling distance is greater than 1. Data being prefetched for iteration i will be consumed in iteration i+2. Figure E-4 represents the case when the leadoff latency plus data transfer latency is greater than the compute latency, which is greater than the data transfer latency.
Page 556: Memory Throughput Bound (Case: Tb >= Tc
Memory Throughput Bound (Case: T When the application or loop is memory throughput bound, the memory latency is no way to be hidden. Under such circumstances, the burst latency is always greater than the compute latency. Examine Figure E-5. Figure E-5 Memory Throughput Bound Pipeline Front-Side Bus Execution pipeline The following relationship calculates the prefetch scheduling distance...
Page 557: Example
Mathematics of Prefetch Scheduling Distance memory to you cannot do much about it. Typically, data copy from one space to another space, for example, graphics driver moving data from writeback memory to write-combining memory, belongs to this category, where performance advantage from prefetch instructions will be marginal.
Page 558: Figure E-6 Accesses Per Iteration, Example 1
Now for the case T examine the following graph. Consider the graph of accesses per iteration in example 1, Figure E-6. Figure E-6 Accesses per Iteration, Example 1 The prefetch scheduling distance is a step function of T computation latency. The steady state iteration latency (il) is either memory-bound or compute-bound depending on T scheduled effectively.
Page 559: Figure E-7 Accesses Per Iteration, Example 2
Figure E-7 Accesses per Iteration, Example 2 psd for different number of cache lines prefetched per iteration In reality, the front-side bus (FSB) pipelining depth is limited, that is, only four transactions are allowed at a time in the Pentium III and Pentium 4 processors.
Page 560 IA-32 Intel® Architecture Optimization E-14...
Page 561 Index 64-bit mode default operand size, 8-1 introduction, 8-1 legacy instructions, 8-1 multiplication notes, 8-2 register usage, 8-2, 8-4 sign-extension, 8-3 software prefetch, 8-6 using CVTSI2SS & CVTSI2SD, 8-6 absolute difference of signed numbers, 4-24 absolute difference of unsigned numbers, 4-23 absolute value, 4-25 accesses per iteration, E-12, E-13 active power, 9-1...
Page 562 coding methodologies, 3-13 coding techniques, 3-12 absolute difference of signed numbers, 4-24 absolute difference of unsigned numbers, 4-23 absolute value, 4-25 clipping to an arbitrary signed range, 4-26 clipping to an arbitrary unsigned range, 4-28 generating constants, 4-21 interleaved pack with saturation, 4-8 interleaved pack without saturation, 4-10 non-interleaved unpack, 4-11 signed unpack, 4-7...
Page 563 2-47, 4-40 instruction selection, 2-73 integer and floating-point multiply, 2-75, 2-76 integer divide, 2-76 integer-intensive application, 4-1 Intel Core Duo processor, 1-31 Intel Core Solo processor, 1-31 Intel Debugger, A-1 Intel Pentium D processor, 1-39 Intel Performance Library Suite, A-2...
Page 564 large load stalls, 2-37 latency, 2-72, 6-5 lea instruction, 2-74 loading and storing to and from the same DRAM page, 4-39 loop blocking, 3-34 loop unrolling, 2-26 loop unrolling option, A-5, A-6 memory bank conflicts, 6-3 memory O=optimization U=using P=prefetch, 6-18 memory operands, 2-71 memory optimization, 4-34...
Page 565 optimizing cache utilization cache management, 6-44 examples, 6-15 non-temporal store instructions, 6-10 prefetch and load, 6-9 prefetch Instructions, 6-8 prefetching, 6-7 SFENCE instruction, 6-15, 6-16 streaming, non-temporal stores, 6-10 optimizing floating-point applications copying, shuffling, 5-17 data arrangement, 5-4 data deswizzling, 5-14 data swizzling using intrinsics, 5-12 horizontal ADD, 5-18 planning considerations, 5-2...
Page 566 reciprocal instructions, 5-2 rounding control option, A-6 sampling event-based, A-10 Self-modifying code, 2-47 SFENCE Instruction, 6-15, 6-16 signed unpack, 4-7 SIMD integer code, 4-2 SIMD-floating-point code, 5-1 simplified 3D geometry pipeline, 6-22 simplified clipping to an arbitrary signed range, 4-28 single-pass versus multi-pass execution, 6-41 smart cache, 1-31 SoA format, 3-29...
Page 567 2 Cameron Close Intel Corp. Long Melford SUFFK Heldmanskamp 37 CO109TS Lemgo NW 32657 Germany Israel Intel Corp. Italy MTM Industrial Center, Intel Corp Italia Spa P.O.Box 498 Milanofiori Palazzo E/4 Haifa Assago 31000 Milan Israel 20094 Fax:972-4-8655444 Italy Fax:39-02-57501221 LATIN AMERICA &...
Page 568 Intel Corp. Intel Corp. 999 CANADA PLACE, 28202 Cabot Road, Suite 404,#11 Suite #363 & #371 Vancouver BC Laguna Niguel CA V6C 3E2 92677 Canada Fax:604-844-2813 Intel Corp. Intel Corp. 657 S Cendros Avenue 2650 Queensview Drive, Solana Beach CA...

Save PDF