Page 1
IA-32 Intel® Architecture Optimization Reference Manual Order Number: 248966-013US April 2006...
Page 2
Intel may make changes to specifications and product descriptions at any time, without notice. This IA-32 Intel ® Architecture Optimization Reference Manual as well as the software described in it is furnished under license and may only be used or copied in accordance with the terms of the license. The information in this man- ual is furnished for informational use only, is subject to change without notice, and should not be construed as a com- mitment by Intel Corporation.
Extended Memory 64 Technology (Intel ® Intel NetBurst Microarchitecture... 1-8 Design Goals of Intel NetBurst Microarchitecture ... 1-8 Overview of the Intel NetBurst Microarchitecture Pipeline ... 1-9 The Front End... 1-11 The Out-of-order Core ... 1-12 Retirement ... 1-12 Front End Pipeline Detail... 1-13 Prefetching...
Page 4
Execution Core ... 1-39 Retirement ... 1-39 Multi-Core Processors... 1-39 Microarchitecture Pipeline and Multi-Core Processors... 1-42 Shared Cache in Intel Core Duo Processors ... 1-42 Load and Store Operations... 1-42 Chapter 2 General Optimization Guidelines Tuning to Achieve Optimum Performance ... 2-1 Tuning to Prevent Known Coding Pitfalls ...
Page 5
Floating-point Exceptions ... 2-60 Floating-point Modes ... 2-62 Improving Parallelism and the Use of FXCH ... 2-68 x87 vs. Scalar SIMD Floating-point Trade-offs ... 2-69 Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core Duo Processors... 2-70 Memory Operands... 2-71 ®...
Page 6
Floating-Point Stalls... 2-72 x87 Floating-point Operations with Integer Operands ... 2-72 x87 Floating-point Comparison Instructions ... 2-72 Transcendental Functions ... 2-72 Instruction Selection... 2-73 Complex Instructions ... 2-74 Use of the lea Instruction... 2-74 Use of the inc and dec Instructions ... 2-75 Use of the shift and rotate Instructions ...
Page 7
Considerations for Code Conversion to SIMD Programming... 3-8 Identifying Hot Spots ... 3-10 Determine If Code Benefits by Conversion to SIMD Execution... 3-11 Coding Techniques ... 3-12 Coding Methodologies... 3-13 Assembly ... 3-15 Intrinsics... 3-15 Classes ... 3-17 Automatic Vectorization ... 3-18 Stack and Data Alignment...
Page 8
Packed Shuffle Word for 64-bit Registers ... 4-18 Packed Shuffle Word for 128-bit Registers ... 4-19 Unpacking/interleaving 64-bit Data in 128-bit Registers... 4-20 Data Movement ... 4-21 Conversion Instructions ... 4-21 Generating Constants ... 4-21 Building Blocks... 4-23 Absolute Difference of Unsigned Numbers ... 4-23 Absolute Difference of Signed Numbers ...
Page 9
Data Alignment... 5-4 Data Arrangement ... 5-4 Vertical versus Horizontal Computation... 5-5 Data Swizzling ... 5-9 Data Deswizzling ... 5-14 Using MMX Technology Code for Copy or Shuffling Functions ... 5-17 Horizontal ADD Using SSE... 5-18 Use of cvttps2pi/cvttss2si Instructions ... 5-21 Flush-to-Zero and Denormals-are-Zero Modes ...
Page 10
Hardware Prefetch ... 6-19 Example of Effective Latency Reduction with H/W Prefetch ... 6-20 Example of Latency Hiding with S/W Prefetch Instruction ... 6-22 Software Prefetching Usage Checklist ... 6-24 Software Prefetch Scheduling Distance ... 6-25 Software Prefetch Concatenation... 6-26 Minimize Number of Software Prefetches ...
Page 11
Key Practices of System Bus Optimization ... 7-17 Key Practices of Memory Optimization ... 7-17 Key Practices of Front-end Optimization ... 7-18 Key Practices of Execution Resource Optimization ... 7-18 Generality and Performance Impact... 7-19 Thread Synchronization ... 7-19 Choice of Synchronization Primitives ... 7-20 Synchronization for Short Periods ...
Page 12
Guidelines for Extending Battery Life... 9-7 Adjust Performance to Meet Quality of Features ... 9-8 Reducing Amount of Work... 9-9 Platform-Level Optimizations... 9-10 Handling Sleep State Transitions ... 9-11 Using Enhanced Intel SpeedStep ® Enabling Intel Enhanced Deeper Sleep ... 9-14 Multi-Core Considerations ... 9-15 Enhanced Intel SpeedStep Thread Migration Considerations...
Page 14
Using Performance Metrics with Hyper-Threading Technology ... B-50 Using Performance Events of Intel Core Solo and Intel Core Duo processors... B-56 Understanding the Results in a Performance Counter ... B-56 Ratio Interpretation ... B-57 Notes on Selected Events ... B-58 Appendix C IA-32 Instruction Latency and Throughput Overview ...
Page 15
Examples Example 2-1 Assembly Code with an Unpredictable Branch ... 2-17 Example 2-2 Code Optimization to Eliminate Branches ... 2-17 Example 2-3 Eliminating Branch with CMOV Instruction... 2-18 Example 2-4 Use of pause Instruction ... 2-19 Example 2-5 Pentium 4 Processor Static Branch Prediction Algorithm... 2-20 Example 2-6 Static Taken Prediction Example ...
Page 16
Example 3-4 Identification of SSE2 with cpuid ... 3-5 Example 3-5 Identification of SSE2 by the OS ... 3-6 Example 3-6 Identification of SSE3 with cpuid ... 3-7 Example 3-7 Identification of SSE3 by the OS ... 3-8 Example 3-8 Simple Four-Iteration Loop ...
Page 17
Example 4-20 Clipping to an Arbitrary Signed Range [high, low]... 4-27 Example 4-21 Simplified Clipping to an Arbitrary Signed Range ... 4-28 Example 4-22 Clipping to an Arbitrary Unsigned Range [high, low]... 4-29 Example 4-23 Complex Multiply by a Constant ... 4-32 Example 4-24 A Large Load after a Series of Small Stores (Penalty)...
Page 18
Example 6-12 Memory Copy Using Hardware Prefetch and Bus Segmentation.. 6-50 Example 7-1 Serial Execution of Producer and Consumer Work Items ... 7-9 Example 7-2 Basic Structure of Implementing Producer Consumer Threads ... 7-11 Example 7-3 Thread Function for an Interlaced Producer Consumer Model ... 7-13 Example 7-4 Spin-wait Loop and PAUSE Instructions...
Page 19
The Intel NetBurst Microarchitecture ... 1-10 Figure 1-4 Execution Units and Ports in the Out-Of-Order Core... 1-19 Figure 1-5 The Intel Pentium M Processor Microarchitecture ... 1-27 Figure 1-6 Hyper-Threading Technology on an SMP... 1-35 Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition and Intel Core Duo Processor ...
Page 20
Sampling Analysis of Hotspots by Location...A-10 Figure A-2 Intel Thread Checker Can Locate Data Race Conditions...A-18 Figure A-3 Intel Thread Profiler Can Show Critical Paths of Threaded Execution Timelines...A-20 Figure B-1 Relationships Between the Cache Hierarchy, IOQ, BSQ and Front Side Bus ...B-10 Figure D-1 Stack Frames Based on Alignment Type ...
Page 21
Tables Table 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters ... 1-20 Table 1-3 Cache Parameters of Pentium M, Intel ® Intel Core™ Duo Processors ... 1-30 Table 1-2 Trigger Threshold and CPUID Signatures for IA-32 Processor Families ... 1-30 Table 1-4 Family And Model Designations of Microarchitectures...
Page 23
The target audience for this manual includes software programmers and compiler writers. This manual assumes that the reader is familiar with the ® basics of the IA-32 architecture and has access to the Intel Architecture Software Developer’s Manual: Volume 1, Basic Architecture;...
® The Intel VTune™ Performance Analyzer can help you analyze and locate hot-spot regions in your applications. On the Pentium 4, Intel ® Xeon and Pentium M processors, this tool can monitor an application through a selection of performance monitoring events and analyze the performance event data that is gathered during code execution.
Chapter 2: General Optimization Guidelines. Describes general code development and optimization techniques that apply to all applications designed to take advantage of the common features of the Intel NetBurst microarchitecture and Pentium M processor microarchitecture. Chapter 3: Coding for SIMD Architectures. Describes techniques and concepts for using the SIMD integer and SIMD floating-point instructions provided by the MMX™...
Page 26
Appendix A: Application Performance Tools. Introduces tools for analyzing and enhancing application performance without having to write assembly code. Appendix B: Intel Pentium 4 Processor Performance Metrics. Provides information that can be gathered using Pentium 4 processor’s performance monitoring events. These performance metrics can help programmers determine how effectively an application is using the features of the Intel NetBurst microarchitecture.
Related Documentation For more information on the Intel architecture, specific techniques, and processor architecture terminology referenced in this manual, see the following documents: • ® Intel C++ Compiler User’s Guide • ® Intel Fortran Compiler User’s Guide • VTune Performance Analyzer online help •...
Notational Conventions This manual uses the following conventions: This type style THIS TYPE STYLE This type style (ellipses) This type style xxviii Indicates an element of syntax, a reserved word, a keyword, a filename, instruction, computer output, or part of a program example.
Page 29
HT Technology and an HT Technology enabled chipset, BIOS and operating system. Performance varies depending on the hardware and software used. Dual-core platform requires an Intel Core Duo, Pentium D processor or Pentium processor Extreme Edition, with appropriate chipset, BIOS, and operating system. Performance varies depending on the hardware and software used.
Intel Core Solo and Intel Core Duo processors incorporate microarchitectural enhancements for performance and power efficiency that are in addition to those introduced in the Pentium M processor. SIMD Technology SIMD computations (see Figure 1-1) were introduced in the IA-32 architecture with MMX technology.
IA-32 Intel® Architecture Processor Family Overview each corresponding pair of data elements (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The results of the four parallel computations are sorted as a set of four packed data elements.
IA-32 execution modes: protected mode, real address mode, and Virtual 8086 mode. SSE, SSE2, and MMX technologies are architectural extensions in the IA-32 Intel architecture. Existing software will continue to run correctly, without modification on IA-32 microprocessors that incorporate these technologies. Existing software will also run correctly in the presence of applications that incorporate SIMD technologies.
SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, and video encoding and decoding. Streaming SIMD Extensions 2 Streaming SIMD extensions 2 add the following: • 128-bit data type with two packed double-precision floating-point operands • 128-bit data types for SIMD integer operation on 16-byte, 8-word, 4-doubleword, or 2-quadword integers •...
® (Intel EM64T) Intel EM64T is an extension of the IA-32 Intel architecture. Intel EM64T increases the linear address space for software to 64 bits and supports physical address space up to 40 bits. The technology also introduces a new operating mode referred to as IA-32e mode.
Intel NetBurst The Pentium 4 processor, Pentium 4 processor Extreme Edition supporting Hyper-Threading Technology, Pentium D processor, Pentium processor Extreme Edition and the Intel Xeon processor implement the Intel NetBurst microarchitecture. This section describes the features of the Intel NetBurst microarchitecture and its operation common to the above processors.
• to operate at high clock rates and to scale to higher performance and clock rates in the future Design advances of the Intel NetBurst microarchitecture include: • a deeply pipelined design that allows for high clock rates (with different parts of the chip running at different clock rates).
Figure 1-3 illustrates a diagram of the major functional blocks associated with the Intel NetBurst microarchitecture pipeline. The following subsections provide an overview for each. Figure 1-3 The Intel NetBurst Microarchitecture...
The Front End The front end of the Intel NetBurst microarchitecture consists of two parts: • fetch/decode unit • execution trace cache It performs the following functions: • prefetches IA-32 instructions that are likely to be executed • fetches required instructions that have not been prefetched •...
IA-32 Intel® Architecture Optimization The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch targets are predicted based on their linear address using branch prediction logic and fetched as soon as possible. Branch targets are fetched from the execution trace cache if they are cached, otherwise they are fetched from the memory hierarchy.
(BTB). This updates branch history. Figure 1-3 illustrates the paths that are most frequently executing inside the Intel NetBurst microarchitecture: an execution loop that interacts with multilevel cache hierarchy and the system bus.
Decoder The front end of the Intel NetBurst microarchitecture has a single decoder that decodes instructions at the maximum rate of one instruction per clock. Some complex instructions must enlist the help of the microcode ROM.
It enables the processor to begin executing instructions long before the branch outcome is certain. Branch delay is the penalty that is incurred in the absence of correct prediction. For Pentium 4 and Intel Xeon processors, the branch delay for a correctly predicted instruction can be as few as zero clock cycles.
To take advantage of the forward-not-taken and backward-taken static predictions, code should be arranged so that the likely target of the branch immediately follows forward branches (see also: “Branch Prediction” in Chapter 2). Branch Target Buffer. Once branch history is available, the Pentium 4 processor can predict the branch outcome even before the branch instruction is decoded.
Appendix C, “IA-32 Instruction Latency and Throughput,” lists some of the more-commonly-used IA-32 instructions with their latency, their issue throughput, and associated execution units (where relevant). Some IA-32 Intel® Architecture Processor Family Overview 1-17...
IA-32 Intel® Architecture Optimization execution units are not pipelined (meaning that µops cannot be dispatched in consecutive cycles and the throughput is less than one per cycle). The number of µops associated with each instruction provides a basis for selecting instructions to generate. All µops executed out of the microcode ROM involve extra overhead.
MMX_MISC handles SIMD reciprocal and som e integer operations Caches The Intel NetBurst microarchitecture supports up to three levels of on-chip cache. At least two levels of on-chip cache are implemented in processors based on the Intel NetBurst microarchitecture. The Intel Xeon processor MP and selected Pentium and Intel Xeon processors may also contain a third-level cache.
Each read due to a cache miss fetches a sector, consisting of two adjacent cache lines; a write operation is 64 bytes. Pentium 4 and Intel Xeon processors with CPUID model encoding value of 2 have a second level cache of 512 KB.
This approach has the following effect: • minimizes disturbance of temporal data in other cache levels IA-32 Intel® Architecture Processor Family Overview 1-21...
Page 50
• avoids the need to access off-chip caches, which can increase the realized bandwidth compared to a normal load-miss, which returns data to all cache levels Situations that are less likely to benefit from software prefetch are: • for cases that are already bandwidth bound, prefetching tends to increase bandwidth demands •...
Page 51
(stride that is greater than the trigger threshold distance), this can achieve additional benefit of improved temporal locality and reducing cache misses in the last level cache significantly. IA-32 Intel® Architecture Processor Family Overview 1-23...
Thus, software optimization of a data access pattern should emphasize tuning for hardware prefetch first to favor greater proportions of smaller-stride data accesses in the workload; before attempting to provide hints to the processor by employing software prefetch instructions. Loads and Stores The Pentium 4 processor employs the following techniques to speed up the execution of memory operations: •...
• Alignment: the store cannot wrap around a cache line boundary, and the linear address of the load must be the same as that of the store IA-32 Intel® Architecture Processor Family Overview 1-25...
® ® Intel Pentium Like the Intel NetBurst microarchitecture, the pipeline of the Intel Pentium M processor microarchitecture contains three sections: • in-order issue front end • out-of-order superscalar execution core • in-order retirement unit Intel Pentium M processor microarchitecture supports a high-speed system bus (up to 533 MHz) with 64-byte line size.
Pentium M processor is shown in Figure 1-5. Figure 1-5 The Intel Pentium M Processor Microarchitecture The Front End The Intel Pentium M processor uses a pipeline depth that enables high performance and low power consumption. It’s shorter than that of the Intel NetBurst microarchitecture.
Page 56
The branch prediction hardware includes dynamic prediction, and branch target buffers. The Intel Pentium M processor has enhanced dynamic branch prediction hardware. Branch target buffers (BTB) predict the direction and target of branches based on an instruction’s address.
MMX technology loads and for most kinds of successive execution operations. Note that SSE Loads can not be fused. Data Prefetching The Intel Pentium M processor supports three prefetching mechanisms: • The first mechanism is a hardware instruction fetcher and is described in the previous section.
Model ID Data is fetched 64 bytes at a time; the instruction and data translation lookaside buffers support 128 entries. See Table 1-3 for processor cache parameters. Table 1-3 Cache Parameters of Pentium M, Intel ® Intel Core™ Duo Processors Level...
Duo processor to minimize bus traffic between two cores accessing a single-copy of cached data. It allows an Intel Core Solo processor (or when one of the two cores in an Intel Core Duo processor is idle) to access its full capacity.
Pentium M processor (see Table 1-2). Front End Execution of SIMD instructions on Intel Core Solo and Intel Core Duo processors are improved over Pentium M processors by the following enhancements: •...
® Intel Hyper-Threading (HT) Technology is supported by specific members of the Intel Pentium 4 and Xeon processor families. The technology enables software to take advantage of task-level, or thread-level parallelism by providing multiple logical processors within a physical processor package. In its first implementation in Intel Xeon processor, Hyper-Threading Technology makes a single physical processor appear as two logical processors.
Page 62
IA-32 Intel® Architecture Optimization The two logical processors each have a complete set of architectural registers while sharing one single physical processor's resources. By maintaining the architecture state of two processors, an HT Technology capable processor looks like two processors to software, including operating system and application code.
(MTRRs) and the performance monitoring resources. For a complete list of the architecture state and exceptions, see the IA-32 Intel® Architecture Software Developer’s Manual, Volumes 3A & 3B. Other resources such as instruction pointers and register renaming tables were replicated to simultaneously track execution and state changes of the two logical processors.
Shared mode: The L1 data cache is fully shared by two logical processors. • Adaptive mode: In adaptive mode, memory accesses using the page directory is mapped identically across logical processors sharing the L1 data cache. The other resources are fully shared. IA-32 Intel® Architecture Processor Family Overview 1-37...
Microarchitecture Pipeline and Hyper-Threading Technology This section describes the HT Technology microarchitecture and how instructions from the two logical processors are handled between the front end and the back end of the pipeline. Although instructions originating from two programs or two threads execute simultaneously and not necessarily in program order in the execution core and memory hierarchy, the front end and back end contain several selection points to select between instructions from the...
The Intel Pentium D processor provides two logical processors in a physical package, each logical processor has a separate execution core and a cache hierarchy. The Dual-core Intel Xeon processor and the Intel IA-32 Intel® Architecture Processor Family Overview 1-39...
Page 68
Each core provides two logical processors sharing an execution core and a cache hierarchy. The Intel Core Duo processor provides two logical processors in a physical package. Each logical processor has a separate execution core (including first-level cache) and a smart second-level cache.
Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition and Intel Core Duo Processor IA-32 Intel® Architecture Processor Family Overview Pentium D Processor Architectual State Architectual State Execution Engine Execution Engine Local APIC Caches Bus Interface System Bus Pentium Processor Extreme Edition...
The Intel Core Duo processor has two symmetric cores that share the second-level cache and a single bus interface (see Figure 1-7). Two threads executing on two cores in an Intel Core Duo processor can take advantage of shared second-level cache, accessing a single-copy of cached data without generating bus traffic.
Second-level cache and the first-level cache of the other core Memory Table 1-5 lists the performance characteristics of generic load and store operations in an Intel Core Duo processor. Numeric values of Table 1-5 are in terms of processor core cycles . Table 1-5...
Page 72
IA-32 Intel® Architecture Optimization when data is written back to memory, the eviction consumes cache bandwidth and bus bandwidth. For multiple cache misses that require the eviction of modified lines and are within a short time, there is an overall degradation in response time of these cache misses.
Intel compilers. The Intel for IA-32 processor family, provides the most of the optimization. For those not using the Intel C++ or Fortran Compiler, the assembly code tuning optimizations may be useful. The explanations are supported by coding examples.
IA-32 processors. Tuning to Prevent Known Coding Pitfalls To produce program code that takes advantage of the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture, you must avoid the coding pitfalls that limit the performance of the target processor family.
“Tuning to Achieve Optimum Performance” section. It also highlights practices that use performance tools. The majority of these guidelines benefit processors based on the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture. Some guidelines benefit one microarchitecture more than the other.
— Set this compiler to produce code for the target processor implementation — Use the compiler switches for optimization and/or profile-guided optimization. These features are summarized in the “Intel® C++ Compiler” section. For more detail, see the Intel® C++ Compiler User’s Guide. • Current-generation performance monitoring tools, such as VTune™...
Optimize Branch Predictability • Improve branch predictability and optimize instruction prefetching by arranging code to be consistent with the static branch prediction assumption: backward taken and forward not taken. • Avoid mixing near calls, far calls and returns. • Avoid implementing a call by pushing the return address and jumping to the target.
• Minimize use of global variables and pointers. • Use the const variables. • Use new cacheability instructions and memory-ordering behavior. Optimize Floating-point Performance • Avoid exceeding representable ranges during computation, since handling these cases can have a performance impact. Do not use a larger precision format (double-extended floating point) unless required, since this increases memory size and bandwidth utilization.
• Avoid longer latency instructions: integer multiplies and divides. Replace them with alternate code sequences (e.g., use shifts instead of multiplies). • Use the address calculation. • Some types of stores use more µops than others, try to use simpler store variants and/or reduce the number of stores.
• Avoid the use of conditionals. • Keep induction (loop) variable expressions simple. • Avoid using pointers, try to replace pointers with arrays and indices. Coding Rules, Suggestions and Tuning Hints This chapter includes rules, suggestions and hints. They are maintained in separately-numbered lists and are targeted for engineers who are: •...
Refer to the “Intel C++ Intrinsics Reference” section of the Intel® C++ Compiler User’s Guide. • C++ class libraries. Refer to the “Intel C++ Class Libraries for SIMD Operations Reference” section of the Intel® C++ Compiler User’s Guide. •...
However, if particular performance problems are noted with the compiled code, some compilers (like the Intel C++ and Fortran Compil- ers) allow the coder to insert intrinsics or inline assembly in order to exert greater control over what code is generated.
Processor Perspectives The majority of the coding recommendations for the Pentium 4 and Intel Xeon processors also apply to Pentium M, Intel Core Solo, and Intel Core Duo processors. However, there are situations where a recommendation may benefit one microarchitecture more than the other.
Page 84
CPUID signature family 6, model 9). On Pentium 4, Intel Xeon processors, Pentium M processor (with CPUID signature family 6, model 13), and Intel Core Solo, and Intel Core Duo processors, such penalties are resolved by artificial dependencies between each partial register write.
• On the Pentium 4 and Intel Xeon processors, the primary code size limit of interest is imposed by the trace cache. On Pentium M processors, code size limit is governed by the instruction cache. • There may be a penalty when instructions with immediates requiring more than 16-bit signed representation are placed next to other instructions that use immediates.
IA-32 processor families. See CPUID instruction in the IA-32 Intel® Architecture Software Developer’s Manual, Volume 2B. For coding techniques that rely on specific parameters of a cache level,...
Branch Prediction Branch optimizations have a significant impact on performance. By understanding the flow of branches and improving the predictability of branches, you can increase the speed of code significantly. Optimizations that help branch prediction are: • Keep code and data on separate pages (a very important item, see more details in the “Memory Accesses”...
Page 88
IA-32 Intel® Architecture Optimization Assembly/Compiler Coding Rule 1. (MH impact, H generality) Arrange code to make basic blocks contiguous and eliminate unnecessary branches. For the Pentium M processor, every branch counts, even correctly predicted branches have a negative effect on the amount of useful code delivered to the processor.
Example 2-1 Assembly Code with an Unpredictable Branch A, B ebx, CONST1 L30: ebx, CONST2 L31: Example 2-2 Code Optimization to Eliminate Branches ebx, ebx A, B setge bl ebx, 1 ebx, CONST3 ebx, CONST2 See Example 2-2. The optimized code sets and B.
Pentium processors and earlier 32-bit Intel architecture processors. Be sure to check whether a processor supports these instructions with the Spin-Wait and Idle Loops The Pentium 4 processor introduces a new...
Branches that do not have a history in the BTB (see the “Branch Prediction” section) are predicted using a static prediction algorithm. The Pentium 4, Pentium M, Intel Core Solo and Intel Core Duo processors have similar static prediction algorithms: •...
Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target.
Examples 2-6, Example 2-7 provide basic rules for a static prediction algorithm. In Example 2-6, the backward branch ( first time through, therefore, the BTB does not issue a prediction. The static predictor, however, will predict the branch to be taken, so a misprediction will not occur.
Inlining, Calls and Returns The return address stack mechanism augments the static and dynamic predictors to optimize specifically for calls and returns. It holds 16 entries, which is large enough to cover the call depth of most programs. If there is a chain of more than 16 nested calls and more than 16 returns in rapid succession, performance may be degraded.
General Optimization Guidelines Assembly/Compiler Coding Rule 6. (H impact, M generality) Do not inline a function if doing so increases the working set size beyond what will fit in the trace cache. Assembly/Compiler Coding Rule 7. (ML impact, ML generality) If there are more than 16 nested calls and returns in rapid succession;...
Page 96
Placing data immediately following an indirect branch can cause a performance problem. If the data consist of all zeros, it looks like a long stream of adds to memory destinations, which can cause resource conflicts and slow down branch recovery. Also, the data immediately following indirect branches may appear as branches to the branch predication hardware, which can branch off to execute other data pages.
indirect branch into a tree where one or more indirect branches are preceded by conditional branches to those targets. Apply this “peeling” procedure to the common target of an indirect branch that correlates to branch history. The purpose of this rule is to reduce the total number of mispredictions by enhancing the predictability of branches, even at the expense of adding more branches.
best performance from a coding effort. An example of peeling out the most favored target of an indirect branch with correlated branch history is shown in Example 2-9. Example 2-9 A Peeling Technique to Reduce Indirect Branch Misprediction function () int n = rand();...
Page 99
• The Pentium 4 processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional branches in the loop. Therefore, if the loop body size is not excessive, and the probable number of iterations is known, unroll inner loops until they have a maximum of 16 iterations.
Compiler Support for Branch Prediction Compilers can generate code that improves the efficiency of branch prediction in the Pentium 4 and Pentium M processors. The Intel C++ Compiler accomplishes this by: • keeping code and data on separate pages •...
Misaligned data access can incur significant performance penalties. This is particularly true for cache line splits. The size of a cache line is 64 bytes in the Pentium 4, Intel Xeon, and Pentium M processors. On the Pentium 4 processor, an access to data unaligned on 64-byte boundary leads to two memory accesses and requires several µops to be...
Page 102
Assembly/Compiler Coding Rule 16. (H impact, H generality) Align data on natural operand size address boundaries. If the data will be accesses with vector instruction loads and stores, align the data on 16 byte boundaries. For best performance, align data as follows: •...
Example 2-11 Code That Causes Cache Line Split esi, 029e70feh edi, 05be5260h Blockmove: eax, DWORD PTR [esi] ebx, DWORD PTR [esi+4] DWORD PTR [edi], eax DWORD PTR [edi+4], ebx esi, 8 edi, 8 edx, 1 Blockmove Figure 2-1 Cache Line Split in Accessing Elements in a Array Address 029e70c1h Line 029e70c0h Line 029e7100h...
Store Forwarding The processor’s memory system only sends stores to memory (including cache) after store retirement. However, store data can be forwarded from a store to a subsequent load from the same address to give a much shorter store-load latency. There are two kinds of requirements for store forwarding.
Pentium M processors than that for Pentium 4 processors. This section describes these restrictions in all cases. It prescribes recommendations to prevent the non-forwarding penalty. Fixing this problem for Pentium 4 and Intel Xeon processors also fixes problem on Pentium M processors. 2-33...
The size and alignment restrictions for store forwarding are illustrated in Figure 2-2. Figure 2-2 Size and Alignment Restrictions in Store Forwarding (a) Sm all load after Large Store (b) Size of Load >= (c) Size of Load >= Store(s) (d) 128-bit Forward Must Be 16-Byte Aligned...
A load that forwards from a store must wait for the store’s data to be written to the store buffer before proceeding, but other, unrelated loads need not wait. Assembly/Compiler Coding Rule 20. (H impact, ML generality) If it is necessary to extract a non-aligned portion of stored data, read out the smallest aligned portion that completely contains the data and shift/mask the data as necessary.
Example 2-13 A Non-forwarding Example of Large Load After Small Store mov [EBP], mov [EBP + 1], ‘b’ mov [EBP + 2], ‘c’ mov [EBP + 3], ‘d’ mov EAX, [EBP] ; The first 4 small store can be consolidated into ;...
When moving data that is smaller than 64 bits between memory locations, 64-bit or 128-bit SIMD register moves are more efficient (if aligned) and can be used to avoid unaligned loads. Although floating-point registers allow the movement of 64 bits at a time, floating point instructions should not be used for this purpose, as data may be inadvertently modified.
However, the overall impact of this problem is much smaller than that from size and alignment requirement violations. The Pentium 4 and Intel Xeon processors predict when loads are both dependent on and get their data forwarded from preceding stores. These predictions can significantly improve performance.
An example of a loop-carried dependence chain is shown in Example 2-17. Example 2-17 An Example of Loop-carried Dependence Chain for (i=0; i<MAX; i++) { a[i] = b[i] * foo; foo = a[i]/3; Data Layout Optimizations User/Source Coding Rule 2. (H impact, M generality) Pad data structures defined in the source code so that every data element is aligned to a natural operand size address boundary.
Cache line size for Pentium 4 and Pentium M processors can impact streaming applications (for example, multimedia). These reference and use data only once before discarding it. Data accesses which sparsely utilize the data within a cache line can result in less efficient utilization of system memory bandwidth.
Page 113
However, if the access pattern of the array exhibits locality, such as if the array index is being swept through, then the Pentium 4 processor prefetches data from struct_of_array structure are accessed together. When the elements of the structure are not accessed with equal frequency, such as when element the other entries, then struct_of_array...
User/Source Coding Rule 3. (M impact, L generality) Beware of false sharing within a cache line (64 bytes) for Pentium 4, Intel Xeon, and Pentium M processors; and within a sector of 128 bytes on Pentium 4 and Intel Xeon processors.
Note that first-level cache lines are 64 bytes. Thus the least significant 6 bits are not considered in alias comparisons. For the Pentium 4 and Intel Xeon processors, data is loaded into the second level cache in a sector of 128 bytes, so the least significant 7 bits are not considered in alias comparisons.
Aliasing Cases in the Pentium Processors Aliasing conditions that are specific to the Pentium 4 processor and Intel Xeon processor are: • 16K for code – there can only be one of these in the trace cache at a time. If two traces whose starting addresses are 16K apart are in the same working set, the symptom will be a high trace cache miss rate.
Aliasing Cases in the Pentium M Processor Pentium M, Intel Core Solo and Intel Core Duo processors have the following aliasing case: • Store forwarding - If there has been a store to an address followed by a load to the same address within a short time window, the load will not proceed until the store data is available.
1 KB subpages. Self-modifying Code Self-modifying code (SMC) that ran correctly on Pentium III processors and prior implementations will run correctly on subsequent implementations, including Pentium 4 and Intel Xeon processors. SMC General Optimization Guidelines pause 2-47...
Saving traffic is particularly important for avoiding partial writes to uncached memory. There are six write-combining buffers (on Pentium 4 and Intel Xeon processors with CPUID signature of family encoding 15, model encoding 3, there are 8 write-combining buffers). Two of these buffers...
Page 121
General Optimization Guidelines write misses; only four write-combining buffers are guaranteed to be available for simultaneous use. Write combining applies to memory type WC; it does not apply to memory type UC. Assembly/Compiler Coding Rule 28. (H impact, L generality) If an inner loop writes to more than four arrays, (four distinct cache lines), apply loop fission to break up the body of the loop such that only four arrays are being written to in each iteration of each of the resulting loops.
RFO since the line is not cached, and there is no such delay. For details on write-combining, see the Intel Architecture Software Devel- oper’s Manual. Locality Enhancement Locality enhancement can reduce data traffic originating from an outer-level sub-system in the cache/memory hierarchy, this is to address the fact that the access-cost in terms of cycle-count from an outer level will be more expensive than from an inner level.
Page 123
Locality enhancement to the last level cache can be accomplished with sequencing the data access pattern to take advantage of hardware prefetching. This can also take several forms: • Transformation of a sparsely populated multi-dimensional array into a one-dimension array such that memory references occur in a sequential, small-stride prefetch.
Minimizing Bus Latency The system bus on Intel Xeon and Pentium 4 processors provides up to 6.4 GB/sec bandwidth of throughput at 200 MHz scalable bus clock rate. (See MSR_EBC_FREQUENCY_ID register.) The peak bus bandwidth is even higher with higher bus clock rates.
General Optimization Guidelines User/Source Coding Rule 8. (H impact, H generality) To achieve effective amortization of bus latency, software should pay attention to favor data access patterns that result in higher concentrations of cache miss patterns with cache miss strides that are significantly smaller than half of the hardware prefetch trigger threshold.
64-bytes into the first-level data cache without polluting the second-level cache. Intel Core Solo and Intel Core Duo processors provide more advanced hardware prefetchers for data relative to those on the Pentium M processors. The key differences are summarized in Table 1-2.
access patterns to suit the hardware prefetcher is highly recommended, and should be a higher-priority consideration than using software prefetch instructions. The hardware prefetcher is best for small-stride data access patterns in either direction with cache-miss stride not far from 64 bytes. This is true for data accesses to addresses that are either known or unknown at the time of issuing the load operations.
Because the trace cache (TC) removes the decoding stage from the pipeline for frequently executed code, optimizing code alignment for decoding is not as important for Pentium 4 and Intel Xeon processors. For the Pentium M processor, code alignment and the alignment of branch target will affect the throughput of the decoder.
Guidelines for Optimizing Floating-point Code User/Source Coding Rule 10. (M impact, M generality) Enable the compiler’s use of SSE, SSE2 or SSE3 instructions with appropriate switches. Follow this procedure to investigate the performance of your floating-point application: • Understand how the compiler handles floating-point code. •...
Page 131
to early out). However, be careful of introducing more than a total of two values for the floating point control word, or there will be a large performance penalty. See “Floating-point Modes”. User/Source Coding Rule 13. (H impact, ML generality) Use fast float-to-int routines, FISTTP, or SSE2 instructions.
• arithmetic underflow • denormalized operand Refer to Chapter 4 of the IA-32 Intel® Architecture Software Developer’s Manual, Volume 1 for the definition of overflow, underflow and denormal exceptions. Denormalized floating-point numbers impact performance in two ways: •...
Page 133
executing SSE/SSE2/SSE3 instructions and when speed is more important than complying to IEEE standard. The following paragraphs give recommendations on how to optimize your code to reduce performance degradations related to floating-point exceptions. Dealing with floating-point exceptions in x87 FPU code Every special situation listed in the “Floating-point Exceptions”...
Underflow exceptions and denormalized source operands are usually treated according to the IEEE 754 specification. If a programmer is willing to trade pure IEEE 754 compliance for speed, two non-IEEE 754 compliant modes are provided to speed situations where underflows and input are frequent: FTZ mode and DAZ mode.
Page 135
FPU control word (FCW), such as when performing conversions to integers. On Pentium M, Intel Core Solo and Intel Core Duo processors; is improved over previous generations. FLDCW Specifically, the optimization for alternate between two constant values efficiently. For the...
Assembly/Compiler Coding Rule 31. (H impact, M generality) Minimize changes to bits 8-12 of the floating point control word. Changes for more than two values (each value being a combination of the following bits: precision, rounding and infinity control, and the rest of bits in FCW) leads to delays that are on the order of the pipeline depth.
Page 137
General Optimization Guidelines If there is more than one change to rounding, precision and infinity bits and the rounding mode is not important to the result; use the algorithm in Example 2-23 to avoid synchronization issues, the overhead of the instruction and having to change the rounding mode.
Example 2-23 Algorithm to Avoid Changing the Rounding Mode _fto132proc ecx,[esp-8] esp,16 ecx,-8 st(0) fistp qword ptr[ecx] fild qword ptr[ecx] edx,[ecx+4]; high dword of integer eax,[ecx] test eax,eax integer_QnaN_or_zero arg_is_not_integer_QnaN: fsubp st(1),st test edx,edx positive fstp dword ptr[ecx]; result of subtraction ecx,[ecx] esp,16 ecx,80000000h...
Page 139
Example 2-23 Algorithm to Avoid Changing the Rounding Mode (continued) positive: fstp dword ptr[ecx] ; 17-18 result of subtraction ecx,[ecx] esp,16 ecx,7fffffffh eax,0 integer_QnaN_or_zero: test edx,7fffffffh arg_is_not_integer_QnaN add esp,16 Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to the rounding mode. Do not use changes in the rounding mode to implement the floor and ceiling functions if this involves a total of more than two values of the set of rounding, precision and infinity bits.
Assembly/Compiler Coding Rule 33. (H impact, L generality) Minimize the number of changes to the precision mode. Improving Parallelism and the Use of FXCH The x87 instruction set relies on the floating point stack for one of its operands. If the dependence graph is a tree, which means each intermediate result is used only once and code is scheduled carefully, it is often possible to use only operands that are on the top of the stack or in memory, and to avoid using operands that are buried under the top of...
This in turn allows instructions to be reordered to make instructions available to be executed in parallel. Out-of-order execution precludes the need for using x87 vs. Scalar SIMD Floating-point Trade-offs There are a number of differences between x87 floating-point code and scalar floating-point code (using SSE and SSE2).
Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core Duo Processors On Intel Core Solo and Intel Core Duo processors, the combination of improved decoding and micro-op fusion allows instructions which were formerly two, three, and four micro-ops to go through all decoders. As a result, scalar SSE/SSE2 code can match the performance of x87 code executing through two floating-point units.
On Pentium M, Intel Core Solo and Intel Core Duo processors; this penalty can be avoided by using movlpd. However, using movlpd causes performance penalty on Pentium 4 processors.
Floating-Point Stalls Floating-point instructions have a latency of at least two cycles. But, because of the out-of-order nature of Pentium II and the subsequent processors, stalls will not necessarily occur on an instruction or µop basis. However, if an instruction has a very long latency such as an , then scheduling can improve the throughput of the overall fdiv application.
Note that transcendental functions are supported only in x87 floating point, not in Streaming SIMD Extensions or Streaming SIMD Extensions 2. Instruction Selection This section explains how to generate optimal assembly code. The listed optimizations have been shown to contribute to the overall performance at the application level on the order of 5%.
Complex Instructions Assembly/Compiler Coding Rule 40. (ML impact, M generality) Avoid using complex instructions (for example, more than four µops and require multiple cycles to decode. Use sequences of simple instructions instead. Complex instructions may save architectural registers, but incur a penalty of 4 µops to set up parameters for the microcode ROM.
Use of the inc and dec Instructions register. This creates a dependence on all previous writes of the flag register. This is especially problematic when these instructions are on the critical path because they are used to change an address for a load on which many other instructions depend.
Operand Sizes and Partial Register Accesses The Pentium 4 processor, Pentium M processor (with CPUID signature family 6, model 13), Intel Core Solo and Intel Core Duo processors do not incur a penalty for partial register accesses; Pentium M processor...
(model 9) does incur a penalty. This is because every operation on a partial register updates the whole register. However, this does mean that there may be false dependencies between any references to partial registers. Example 2-24 demonstrates a series of false and real dependencies caused by referencing partial registers.
Table 2-3 illustrates using packing three byte values into a register. Table 2-3 Avoiding Partial Register Stall When Packing Byte Values A Sequence with Partial Register Stall mov al,byte ptr a[2] shl eax,16 mov ax,word ptr a movd mm0,eax Assembly/Compiler Coding Rule 44. (ML impact, L generality) Use simple instructions that are less than eight bytes in length.
Page 151
less delay than the partial register update problem mentioned above, but the performance gain may vary. If the additional μop is a critical problem, can sometimes be used as alternative. movsx Sometimes sign-extended semantics can be maintained by zero-extending operands. For example, the C code in the following statements does not need sign extension, nor does it need prefixes for operand size overrides: static short int a, b;...
FF). Use of an LCP causes a change in the number of bytes to encode the displacement operand in the instruction. On Pentium M, Intel Core Solo and Intel Core Duo processors; the following situation causes extra delays when decoding an instruction with an LCP: •...
• Processing an instruction with the 0x66 prefix that (i) has a modr/m byte in its encoding and (ii) the opcode byte of the instruction happens to be aligned on byte 14 of an instruction fetch line. The performance delay in this case is approximately twice of those other two situations.
Page 154
String move/store instructions have multiple data granularities. For efficient data movement, larger data granularities are preferable. This means better efficiency can be achieved by decomposing an arbitrary counter value into a number of doublewords plus single byte moves with a count value less or equal to 3. Because software can use SIMD data movement instructions to move 16 bytes at a time, the following paragraphs discuss general guidelines for designing and implementing high-performance library functions such as...
Page 155
For cases N < a small count, where the small count threshold will vary between microarchitectures (empirically, 8 may be a good value when optimizing for Intel NetBurst microarchitecture). Each case can be coded directly without the overhead of a looping structure.
Page 156
improve address alignment, a small piece of prolog code using movsb/stosb with count less than 4 can be used to peel off the non-aligned data moves before starting to use movsd/stosd. • For cases where N is less than half the size of last level cache, throughput consideration may favor either: (a) an approach using REP string with the largest data granularity because REP string has little overhead for loop iteration, and the branch misprediction...
(i=0;i<size;i++) *d++ = (char)c; Memory routines in the runtime library generated by Intel Compilers are optimized across wide range of address alignment, counter values, and microarchitectures. In most cases, applications should take advantage of the default memory routines provided by Intel Compilers.
In some situations, the byte count of the data to operate is known by the context (versus from a parameter passed from a call). One can take a simpler approach than those required for a general-purpose library routine. For example, if the byte count is also small, using rep movsb/stosb with count less than four can ensure good address alignment and loop-unrolling to finish the remaining data;...
The xorps xorpd cannot be used to break dependence chains. In Intel Core Solo and Intel Core Duo processors; the instructions can be used to clear execution dependencies on the pxor zero evaluation of the destination register.
Often a produced value must be compared with zero, and then used in a branch. Because most Intel architecture instructions set the condition codes as part of their execution, the compare instruction may be eliminated.
Page 161
as an alternative; it writes all 128 bits. Even though this movapd instruction has a longer latency, the μops for execution port and this port is more likely to be free. The change can impact performance. There may be exceptional cases where the latency matters more than the dependence or the execution port.
Prolog Sequences Assembly/Compiler Coding Rule 57. (M impact, MH generality) In routines that do not need a frame pointer and that do not have called routines that modify does not apply in the following cases: a routine is called that leaves modified upon return, for example, structured or C++ style exception handling;...
Example 2-25 Recombining LOAD/OP Code into REG,MEM Form LOAD reg1, mem1 ... code that does not write to reg1... reg2, reg1 ... code that does not use reg1 ... Using memory as a destination operand may further reduce register pressure at the slight risk of making trace cache packing more difficult. On the Pentium 4 processor, the sequence of loading a value from memory into a register and adding the results in a register to memory is faster than the alternate sequence of adding a value from memory to a...
Scheduling Rules for the Pentium 4 Processor Decoder The Pentium 4 and Intel Xeon processors have a single decoder that can decode instructions at the maximum rate of one instruction per clock.
Because micro-ops are delivered from the trace cache in the common cases, decoding rules are not required. Scheduling Rules for the Pentium M Processor Decoder The Pentium M processor has three decoders, but the decoding rules to supply micro-ops at high bandwidth are less stringent than those of the Pentium III processor.
Page 166
Extensions 2. Thus the vector length ranges from 2 to 16, depending on the instruction extensions used and on the data type. The Intel C++ Compiler supports vectorization in three ways: • The compiler may be able to generate SIMD code without intervention from the user.
User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches. User/Source Coding Rule 20. (M impact, ML generality) Keep induction (loop) variables expressions simple. Miscellaneous This section explains separate guidelines that do not belong to any category described above.
The other NOPs have no special hardware support. Their input and output registers are interpreted by the hardware. Therefore, a code generator should arrange to use the register containing the oldest value as input, so that the NOP will dispatch and release RS resources at the earliest possible opportunity.
User/Source Coding Rule 3. (M impact, L generality) Beware of false sharing within a cache line (64 bytes) for both Pentium 4, Intel Xeon, and Pentium M processors; and within a sector of 128 bytes on Pentium 4 and Intel Xeon processors. 2-42 User/Source Coding Rule 4.
Page 170
User/Source Coding Rule 8. (H impact, H generality) To achieve effective amortization of bus latency, software should.pay attention to favor data access patterns that result in higher concentrations of cache miss patterns with cache miss strides that are significantly smaller than half of the hardware prefetch trigger threshold.
look-up-table-based algorithm using interpolation techniques. It is possible to improve transcendental performance with these techniques by choosing the desired numeric precision, the size of the look-up tableland taking advantage of the parallelism of the Streaming SIMD Extensions and the Streaming SIMD Extensions 2 instructions.
Page 172
IA-32 Intel® Architecture Optimization order engine. When tuning, note that all IA-32 based processors have very high branch prediction rates. Consistently mispredicted are rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch. 2-16 Assembly/Compiler Coding Rule 3.
Page 173
General Optimization Guidelines Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in 16-byte chunks. 2-22 Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not put more than two end loop branches in a 16-byte chunk. 2-22 Assembly/Compiler Coding Rule 12.
Page 174
IA-32 Intel® Architecture Optimization Assembly/Compiler Coding Rule 18. (H impact, M generality) A load that forwards from a store must have the same address start point and therefore the same alignment as the store data. 2-34 Assembly/Compiler Coding Rule 19. (H impact, M generality) The data of a load which is forwarded from a store must be completely contained within the store data.
Page 175
General Optimization Guidelines first-level cache working set. Avoid having more than 8 cache lines that are some multiple of 64 KB apart in the same second-level cache working set. Avoid having a store followed by a non-dependent load with addresses that differ by a multiple of 4 KB.
Page 176
IA-32 Intel® Architecture Optimization Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to the rounding mode. Do not use changes in the rounding mode to implement the floor and ceiling functions if this involves a total of more than two values of the set of rounding, precision and infinity bits.
Page 177
Assembly/Compiler Coding Rule 42. (M impact, H generality) instructions should be replaced with an because overwrite all flags, whereas inc and dec do not, therefore creating false dependencies on earlier instructions that set the flags. 2-73 Assembly/Compiler Coding Rule 43. (ML impact, L generality) Avoid by register or rotate rotate...
Page 178
instead of a zero and saves encoding space. Avoid comparing a constant to a memory operand. It is preferable to load the memory operand and compare the constant to a register. 2-79 Assembly/Compiler Coding Rule 51. (ML impact, M generality) Eliminate unnecessary compare with zero instructions by using the appropriate conditional jump instruction when the flags are already set by a preceding arithmetic instruction.
Page 179
General Optimization Guidelines Assembly/Compiler Coding Rule 56. (M impact, ML generality) For arithmetic or logical operations that have their source operand in memory and the destination operand is in a register, attempt a strategy that initially loads the memory operand to a register followed by a register to register ALU operation.
Tuning Suggestions Tuning Suggestion 1. Rarely, a performance problem may be noted due to executing data on a code page as instructions. The only condition where this is likely to happen is following an indirect branch that is not resident in the trace cache. If a performance problem is clearly due to this problem, try moving the data elsewhere, or inserting an illegal opcode or instruction immediately following the indirect branch.
Coding for SIMD Architectures Intel Pentium 4, Intel Xeon and Pentium M processors include support for Streaming SIMD Extensions 2 (SSE2), Streaming SIMD Extensions technology (SSE), and MMX technology. In addition, Streaming SIMD Extensions 3 (SSE3) were introduced with the Pentium 4 processor supporting Hyper-Threading Technology at 90 nm technology.
Checking for Processor Support of SIMD Technologies This section shows how to check whether a processor supports MMX technology, SSE, SSE2, or SSE3. SIMD technology can be included in your application in three ways: Check for the SIMD technology during installation. If the desired SIMD technology is available, the appropriate DLLs can be installed.
Example 3-2 shows how to find the SSE feature bit (bit 25) in the feature flags. Coding for SIMD Architectures ; identify signature is genuine intel ; request for feature flags ; 0Fh, 0A2h cpuid instruction ; is MMX technology bit (bit ;...
__asm xorps xmm0, xmm0 ;Streaming SIMD Extension _except(EXCEPTION_EXECUTE_HANDLER) { if (_exception_code()==STATUS_ILLEGAL_INSTRUCTION) /* SSE are supported by OS */ return (true); ; identify signature is genuine intel ; request for feature flags ; 0Fh, 0A2h ; bit 25 in feature flags equal to 1 Found /* SSE not supported */ return (false);...
See Example 3-5. Coding for SIMD Architectures instruction. cpuid for SSE2 technology existence. cpuid ; identify signature is genuine intel ; request for feature flags ; 0Fh, 0A2h cpuid instruction ; bit 26 in feature flags equal to 1...
Example 3-5 Identification of SSE2 by the OS bool OSSupportCheck() { _try { __asm xorpd xmm0, xmm0 ; SSE2} _except(EXCEPTION_EXECUTE_HANDLER) { if _exception_code()==STATUS_ILLEGAL_INSTRUCTION) /* SSE2 are supported by OS */ Checking for Streaming SIMD Extensions 3 Support SSE3 includes 13 instructions, 11 of those are suited for SIMD or x87 style programming.
MONITOR and MWAIT can be done by executing the MONITOR execution and trap for an exception similar to the sequence shown in Example 3-7. Coding for SIMD Architectures ; identify signature is genuine intel ; request for feature flags ; 0Fh, 0A2h cpuid instruction...
Example 3-7 Identification of SSE3 by the OS bool SSE3_SIMD_SupportCheck() { _try { __asm addsubpd xmm0, xmm0 ; SSE3} _except(EXCEPTION_EXECUTE_HANDLER) { if _exception_code()==STATUS_ILLEGAL_INSTRUCTION) /* SSE3 SIMD and FISTTP instructions are supported */ Considerations for Code Conversion to SIMD Programming The VTune Performance Enhancement Environment CD provides tools to aid in the evaluation and tuning.
Figure 3-1 Converting to Streaming SIMD Extensions Chart Floating Point W hy FP? Range or Precision Can convert to Integer? Can convert to Single-precision? Identify Hot Spots in Code Code benefits from SIMD Integer or floating-point? Perform ance Change to use SIMD Integer Change to use Single Precision...
To use any of the SIMD technologies optimally, you must evaluate the following situations in your code: • fragments that are computationally intensive • fragments that are executed often enough to have an impact on performance • fragments that with little data-dependent control flow •...
Intel analyzer is designed specifically for all of the Intel architecture (IA)-based processors, including the Pentium 4 processor, it can offer these detailed approaches to working with IA. See “Code Optimization Options”...
XMM registers). • Re-code the loop with the SIMD instructions. Each of these actions is discussed in detail in the subsequent sections of this chapter. These sections also discuss enabling automatic vectorization via the Intel C++ Compiler. 3-12...
Coding Methodologies Software developers need to compare the performance improvement that can be obtained from assembly code versus the cost of those improvements. Programming directly in assembly language for a target platform may produce the required performance gain, however, assembly code is not portable between processor architectures and is expensive to write and maintain.
The examples that follow illustrate the use of coding adjustments to enable the algorithm to benefit from the SSE. The same techniques may be used for single-precision floating-point, double-precision floating-point, and integer data under SSE2, SSE, and MMX technology. As a basis for the usage model discussed in this section, consider a simple loop shown in Example 3-8.
XMMWORD PTR [ecx], xmm0 Intrinsics Intrinsics provide the access to the ISA functionality using C/C++ style coding instead of assembly language. Intel has defined three sets of intrinsic functions that are implemented in the Intel support the MMX technology, Streaming SIMD Extensions and Streaming SIMD Extensions 2.
The intrinsics map one-to-one with actual Streaming SIMD Extensions assembly code. The for the intrinsics are defined is part of the Intel C++ Compiler included with the VTune Performance Enhancement Environment CD. Intrinsics are also defined for the MMX technology ISA. These are...
C++ classes, the performance of applications using this methodology can approach that of one using the intrinsics. Further details on the use of these classes can be found in the Intel C++ Class Libraries for SIMD Operations User’s Guide, order number 693500.
Again, the example is assuming the arrays, passed to the routine, are already aligned to 16-byte boundary. Automatic Vectorization The Intel C++ Compiler provides an optimization mechanism by which loops, such as in Example 3-8 can be automatically vectorized, or converted into Streaming SIMD Extensions code. The compiler uses similar techniques to those used by a programmer to identify whether a loop is suitable for conversion to SIMD.
(See documentation for the Intel C++ Compiler). The restrict keyword avoids the associated overhead altogether. Refer to the Intel® C++ Compiler User’s Guide for details. Coding for SIMD Architectures switches of the Intel...
Stack and Data Alignment To get the most performance out of code written for SIMD technologies data should be formatted in memory according to the guidelines described in this section. Assembly code with an unaligned accesses is a lot slower than an aligned access. Alignment and Contiguity of Data Access Patterns The 64-bit packed data types defined by MMX technology, and the 128-bit packed data types for Streaming SIMD Extensions and...
By adding the padding variable the first element is aligned to 8 bytes (64 bits), all following elements will also be aligned. The sample declaration follows: typedef struct { short x,y,z; char a; char pad; } Point; Point pt[N]; Using Arrays to Make Data Contiguous In the following code, for (i=0;...
IA-32 ( implemented in most compilers, do not provide any mechanism for ensuring that certain local data and certain parameters are 16-byte aligned. Therefore, Intel has defined a new set of IA-32 software conventions for alignment to support the new __m128...
“holes” (due to padding) in the argument block. These new conventions presented in this section as implemented by the Intel C++ Compiler can be used as a guideline for an assembly language code as well. In many cases, this section assumes the use of the data types, as defined by the Intel C++ Compiler, which represents an array of four 32-bit floats.
8-byte alignment. The following discussion and examples describe alignment techniques for Pentium 4 processor as implemented with the Intel C++ Compiler. Compiler-Supported Alignment The Intel C++ Compiler provides the following methods to ensure that the data is aligned. Alignment by F32vec4...
Page 205
__declspec(align(16)) declarations to force 16-byte alignment. This is particularly useful for local or global data declarations that are assigned to 128-bit data types. The syntax for it is __declspec(align(integer-constant)) where the integer-constant than 32. For example, the following increases the alignment to 16-bytes: __declspec(align(16)) float buffer[400];...
Page 206
128-bit data. The default behavior is to use to align routines with 8- or 16-byte data types to 16-bytes. For more details, see relevant Intel application notes in the Intel Architecture Performance Training Center provided with the SDK and the Intel® C++ Compiler User’s Guide.
Improving Memory Utilization Memory performance can be improved by rearranging data and algorithms for SSE 2, SSE, and MMX technology intrinsics. The methods for improving memory performance involve working with the following: • Data structure layout • Strip-mining for vectorization and memory utilization •...
SoA Data Structure Example 3-15 typedef struct{ float x[NumOfVertices]; float y[NumOfVertices]; float z[NumOfVertices]; int a[NumOfVertices]; int b[NumOfVertices]; int c[NumOfVertices]; . . . } VerticesList; VerticesList Vertices; There are two options for computing data in AoS format: perform operation on the data as it stands in AoS format, or re-arrange it (swizzle it) into SoA format dynamically.
Page 209
Example 3-16 AoS and SoA Code Samples (continued) addps xmm1, xmm0 movaps xmm2, xmm1 shufps xmm2, xmm2,55h addps xmm2, xmm1 ; SoA code ; X = x0,x1,x2,x3 ; Y = y0,y1,y2,y3 ; Z = z0,z1,z2,z3 ; A = xF,xF,xF,xF ; B = yF,yF,yF,yF ;...
Page 210
but is somewhat inefficient as there is the overhead of extra instructions during computation. Performing the swizzle statically, when the data structures are being laid out, is best as there is no runtime overhead. As mentioned earlier, the SoA arrangement allows more efficient use of the parallelism of the SIMD technologies because the data is ready for computation in a more optimal vertical manner: multiplying components...
Page 211
Note that SoA can have the disadvantage of requiring more independent memory stream references. A computation that uses arrays Example 3-15 would require three separate data streams. This can require the use of more prefetches, additional address generation calculations, as well as having a greater impact on DRAM page access efficiency.
Strip Mining Strip mining, also known as loop sectioning, is a loop transformation technique for enabling SIMD-encodings of loops, as well as providing a means of improving memory performance. First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine.
Example 3-18 Pseudo-code Before Strip Mining (continued) for (i=0; i<Num; i++) { Lighting(v[i]); The main loop consists of two functions: transformation and lighting. For each object, the main loop calls a transformation routine to update some data, then calls the lighting routine to further work on the data. If the size of array that were cached during v[i]...
In Example 3-19, the computation has been strip-mined to a size . The value strip_size elements of array given element still be in the cache when we perform improve performance over the non-strip-mined code. Loop Blocking Loop blocking is another useful technique for memory performance optimization.
This situation can be avoided if the loop is blocked with respect to the cache size. In Figure 3-3, a factor. Suppose that array will be eight cache lines (32 bytes each). In the first iteration of the inner loop, A[0, 0:7] will be completely consumed by the first iteration of the B[0, 0:7]...
As one can see, all the redundant cache misses can be eliminated by applying this loop blocking technique. If also help reduce the penalty from DTLB (data translation look-aside buffer) misses. In addition to improving the cache/memory performance, this optimization technique also saves external bus bandwidth.
However, the consumers should not be scheduled near the producer. SIMD Optimizations and Microarchitectures Pentium M, Intel Core Solo and Intel Core Duo processors have a different microarchitecture than Intel NetBurst following sub-section discusses optimizing SIMD code targeting Intel Core Solo and Intel Core Duo processors.
Using the VTune analyzer can help you with various phases required for optimized performance. See “Intel® VTune™ Performance Analyzer” in Appendix A for more details on how to use the VTune analyzer. After every effort to optimize, you should check the performance gains to see where you are making your major optimization gains.
The SIMD integer instructions provide performance improvements in applications that are integer-intensive and can take advantage of the SIMD architecture of Pentium 4, Intel Xeon, and Pentium M processors. The guidelines for using these instructions in addition to the guidelines...
SIMD data in the XMM register is strongly discouraged. • Use the optimization rules and guidelines described in Chapter 2 and Chapter 3 that apply to the Pentium 4, Intel Xeon and Pentium M processors. • Take advantage of hardware prefetcher where possible. Use prefetch instruction only when data access patterns are irregular and prefetch distance can be pre-determined.
Using SIMD Integer with x87 Floating-point All 64-bit SIMD integer instructions use the MMX registers, which share register state with the x87 floating-point stack. Because of this sharing, certain rules and considerations apply. Instructions which use the MMX registers cannot be freely intermixed with x87 floating-point registers.
Using clears all of the valid bits, effectively emptying the x87 emms floating-point stack and making it ready for new x87 floating-point operations. The using operations on the MMX registers and using operations on the x87 floating-point stack. On the Pentium 4 processor, there is a finite overhead for using the Failure to use the between operations on the MMX registers and operations on the x87...
__m64 x = _m_paddd(y, z); float f = init(); Further, you must be aware that your code generates an MMX instruction, which uses the MMX registers with the Intel C++ Compiler, in the following situations: • when using a 64-bit SIMD integer intrinsic from MMX technology, SSE, or SSE2 •...
Data Alignment Make sure that 64-bit SIMD integer data is 8-byte aligned and that 128-bit SIMD integer data is 16-byte aligned. Referencing unaligned 64-bit SIMD integer data can incur a performance penalty due to accesses that span 2 cache lines. Referencing unaligned 128-bit SIMD integer data will result in an exception unless the double-quadword unaligned) instruction is used.
Example 4-2 Unsigned Unpack Instructions ; Input: MM7 0 ; Output: movq MM1, MM0 punpcklwd MM0, MM7 punpckhwd MM1, MM7 Signed Unpack Signed numbers should be sign-extended when unpacking the values. This is similar to the zero-extend shown above except that the instruction (packed shift right arithmetic) is used to effectively sign extend the values.
Example 4-3 Signed Unpack Code ; Input: ; Output: movq MM1, MM0 punpcklwd MM0, MM0 punpckhwd MM1, MM1 psrad MM0, 16 source psrad MM1, 16 Interleaved Pack with Saturation The pack instructions pack two values into the destination register in a predetermined order.
Figure 4-1 PACKSSDW mm, mm/mm64 Instruction Example m m /m 64 Figure 4-2 illustrates two values interleaved in the destination register, and Example 4-4 shows code that uses the operation. The two signed doublewords are used as source operands and the result is interleaved signed words.
16-bit values of the two sources into eight saturated eight-bit unsigned values in the destination. A complete specification of the MMX instruction set can be found in the Intel Architecture MMX Technology Programmer’s Reference Manual, order number 243007.
Example 4-5 Interleaved Pack without Saturation ; Input: ; Output: pslld MM1, 16 pand MM0, {0,ffff,0,ffff} MM0, MM1 Non-Interleaved Unpack The unpack instructions perform an interleave merge of the data elements of the destination and source operands into the destination register.
Figure 4-3 Result of Non-Interleaved Unpack Low in MM0 The other destination register will contain the opposite combination illustrated in Figure 4-4. Figure 4-4 Result of Non-Interleaved Unpack High in MM1 Code in the Example 4-6 unpacks two packed-word sources in a non-interleaved way.
Example 4-6 Unpacking Two Packed-word Sources in a Non-interleaved Way ; Input: ; Output: movq MM2, MM0 punpckldq MM0, MM1 punpckhdq MM2, MM1 Extract Word instruction takes the word in the designated MMX register pextrw selected by the two least significant bits of the immediate value and moves it to the lower half of a 32-bit integer register, see Figure 4-5 and Example 4-7.
Figure 4-5 pextrw Instruction Example 4-7 pextrw Instruction Code Input: ; Output: movq mm0, [eax] pextrw edx, mm0, 0 Insert Word instruction loads a word from the lower half of a 32-bit pinsrw integer register or from memory and inserts it in the MMX technology destination register at a position defined by the two least significant bits of the immediate constant.
Figure 4-6 pinsrw Instruction Example 4-8 pinsrw Instruction Code ; Input: ; Output: eax, [edx] pinsrw mm0, eax, 1 If all of the operands in a register are being replaced by a series of instructions, it can be useful to clear the content and break the pinsrw dependence chain by either using the register.
Packed Shuffle Word for 64-bit Registers instruction (see Figure 4-8, Example 4-11) uses the pshuf immediate ( imm8 two MMX registers or one MMX register and a 64-bit memory location. Bits 1 and 0 of the immediate value encode the source for destination word 0 in MMX register ( Bits 1 - 0...
Example 4-11 pshuf Instruction Code ; Input: ; Output: movq mm0, [edi] pshufw mm1, mm0, 0x1b Packed Shuffle Word for 128-bit Registers pshuflw pshufhw word field within the low/high 64 bits to any result word field in the low/high 64 bits, using an 8-bit immediate operand; the other high/low 64 bits are passed through from the source operand.
Data Movement There are two additional instructions to enable data movement from the 64-bit SIMD integer registers to the 128-bit SIMD registers. instruction moves the 64-bit integer data from an MMX movq2dq register (source) to a 128-bit destination register. The high-order 64 bits of the destination register are zeroed-out.
Page 242
Example 4-15 Generating Constants (continued) pxor MM0, MM0 pcmpeq MM1, MM1 psubb MM0, MM1 [psubw ; three instructions above generate ; the constant 1 in every ; packed-byte [or packed-word] ; (or packed-dword) field pcmpeq MM1, MM1 psrlw MM1, 16-n(psrld ;...
Building Blocks This section describes instructions and algorithms which implement common code building blocks efficiently. Absolute Difference of Unsigned Numbers Example 4-16 computes the absolute difference of two unsigned numbers. It assumes an unsigned packed-byte data type. Here, we make use of the subtract instruction with unsigned saturation.
Absolute Difference of Signed Numbers Chapter 4 computes the absolute difference of two signed numbers. The technique used here is to first sort the corresponding elements of the input operands into packed words of the maximum values, and packed words of the minimum values. Then the minimum values are subtracted from the maximum values to generate the required absolute difference.
Example 4-17 Absolute Difference of Signed Numbers (continued) movq MM2, MM0 pcmpgtw MM0, MM1 movq MM4, MM2 pxor MM2, MM1 pand MM2, MM0 pxor MM4, MM2 pxor MM1, MM2 psubw MM1, MM4 Absolute Value Use Example 4-18 to compute | assumes signed words to be the operands.
Clipping to an Arbitrary Range [high, low] This section explains how to clip a values to a range [ Specifically, if the value is less than high, packed-subtract instructions with saturation (signed or unsigned), which means that this technique can only be used on packed-byte and packed-word data types.
Highly Efficient Clipping For clipping signed words to an arbitrary range, the instructions may be used. For clipping unsigned bytes to an arbitrary range, the pmaxub shows how to clip signed words to an arbitrary range; the code for clipping unsigned bytes is similar. Example 4-19 Clipping to a Signed Range of Words [high, low] ;...
The code above converts values to unsigned numbers first and then clips them to an unsigned range. The last instruction converts the data back to signed data and places the data within the signed range. Conversion to unsigned data is required for correct results when ( 0x8000 If ( high...
packed-subtract instructions with unsigned saturation, thus this technique can only be used on packed-bytes and packed-words data types. The example illustrates the operation on word values. Example 4-22 Clipping to an Arbitrary Unsigned Range [high, low] ; Input: unsigned source operands ;...
Unsigned Byte instruction returns the maximum between the eight pmaxub unsigned bytes in either two SIMD registers, or one SIMD register and a memory location. instruction returns the minimum between the eight pminub unsigned bytes in either two SIMD registers, or one SIMD register and a memory location.
Figure 4-9 PSADBW Instruction Example The subtraction operation presented above is an absolute difference, that t = abs(x-y values are summed together, and the result is written into the lower word of the destination register. Packed Average (Byte/Word) pavgb pavgw source operand to the unsigned data elements of the destination register, along with a carry-in.
instruction operates on packed unsigned bytes and the PAVGB instruction operates on packed unsigned words. Complex Multiply by a Constant Complex multiplication is an operation which requires four multiplications and two additions. This is exactly how the instruction operates. In order to use this instruction, you need to format the data into multiple 16-bit values.
Note that the output is a packed doubleword. If needed, a pack instruction can be used to convert the result to 16-bit (thereby matching the format of the input). Packed 32*32 Multiply instruction performs an unsigned multiply on the lower PMULUDQ pair of double-word operands within each 64-bit chunk from the two sources;...
Memory Optimizations You can improve memory accesses using the following techniques: • Avoiding partial memory accesses • Increasing the bandwidth of memory fills and video fills • Prefetching data with Streaming SIMD Extensions (see Chapter 6, “Optimizing Cache Usage”). The MMX registers and XMM registers allow you to move large quantities of data without stalling the processor.
Partial Memory Accesses Consider a case with large load after a series of small stores to the same area of memory (beginning at memory address stall in this case as shown in Example 4-24. Example 4-24 A Large Load after a Series of Small Stores (Penalty) mem, eax mem + 4, ebx movq...
Let us now consider a case with a series of small loads after a large store to the same area of memory (beginning at memory address shown in Example 4-26. Most of the small loads will stall because they are not aligned with the store; see “Store Forwarding” in Chapter 2 for more details.
Optimizing for SIMD Integer Applications These transformations, in general, increase the number of instructions required to perform the desired operation. For Pentium II, Pentium III, and Pentium 4 processors, the benefit of avoiding forwarding problems outweighs the performance penalty due to the increased number of instructions, making the transformations worthwhile.
SSE3 provides an instruction LDDQU for loading from memory address that are not 16 byte aligned. LDDQU is a special 128-bit unaligned load designed to avoid cache line splits. If the address of the load is aligned on a 16-byte boundary, LDQQU loads the 16 bytes requested.
(video fills). These recommendations are relevant for all Intel architecture processors with MMX technology and refer to cases in which the loads and stores do not hit in the first- or second-level cache.
same DRAM page have shorter latencies than sequential accesses to different DRAM pages. In many systems the latency for a page miss (that is, an access to a different page instead of the page previously accessed) can be twice as large as the latency of a memory page hit (access to the same page as the previous access).
— code sequence is rewritten to use the instructions (shift double quad-word operand by bytes). SIMD Optimizations and Microarchitectures Pentium M, Intel Core Solo and Intel Core Duo processors have a different microarchitecture than Intel NetBurst following sections discuss optimizing SIMD code that targets Intel Core Solo and Intel Core Duo processors.
The net of using 128-bit SIMD integer instruction on Intel Core Solo and Intel Core Duo processors is likely to be slightly positive overall, but there may be a few situations where they will generate an unfavorable performance impact.
Optimizing for SIMD Floating-point Applications This chapter discusses general rules of optimizing for the single-instruction, multiple-data (SIMD) floating-point instructions available in Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (SSE2)and Streaming SIMD Extensions 3 (SSE3). This chapter also provides examples that illustrate the optimization techniques for single-precision and double-precision SIMD floating-point applications.
• Use MMX technology instructions and registers or for copying data that is not used later in SIMD floating-point computations. • Use the reciprocal instructions followed by iteration for increased accuracy. These instructions yield reduced accuracy but execute much faster. Note the following: —...
SIMD floating-point code uses a flat register model, whereas x87 floating-point code uses a stack model. Using scalar floating-point code eliminates the need to use performance limit on the Intel Pentium 4 processor. • Mixing with MMX technology code without penalty.
When using scalar floating-point instructions, it is not necessary to ensure that the data appears in vector form. However, all of the optimizations regarding alignment, scheduling, instruction selection, and other optimizations covered in Chapter 2 and Chapter 3 should be observed.
For some applications, e.g., 3D geometry, the traditional data arrangement requires some changes to fully utilize the SIMD registers and parallel techniques. Traditionally, the data layout has been an array of structures (AoS). To fully utilize the SIMD registers in such applications, a new data layout has been proposed—a structure of arrays (SoA) resulting in more optimized performance.
Page 268
simultaneously referred to as an diagram below) are computed in parallel, and the array is updated one vertex at a time. When data structures are organized for the horizontal computation model, sometimes the availability of homogeneous arithmetic operations in SSE and SSE2 may cause inefficiency or require additional intermediate movement between data elements.
To utilize all 4 computation slots, the vertex data can be reorganized to allow computation on each component of 4 separate vertices, that is, processing multiple vectors simultaneously. This can also be referred to as an SoA form of representing vertices data shown in Table 5-1. Table 5-1 SoA Form of Representing Vertices Data Vx array...
Figure 5-2 Dot Product Operation Figure 5-2 shows how 1 result would be computed for 7 instructions if the data were organized as AoS and using SSE alone: 4 results would require 28 instructions. Example 5-1 Pseudocode for Horizontal (xyz, AoS) Computation mulps movaps shufps...
Now consider the case when the data is organized as SoA. Example 5-2 demonstrates how 4 results are computed for 5 instructions. Example 5-2 Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation mulps ; x*x' for all 4 x-components of 4 vertices mulps ;...
To gather data from 4 different memory locations on the fly, follow steps: Identify the first half of the 128-bit memory location. Group the different halves together using the form an xyxy From the 4 attached halves, get the by using another shuffle. yyyy is derived the same way but only requires one shuffle.
Example 5-4 shows the same data -swizzling algorithm encoded using the Intel C++ Compiler’s intrinsics for SSE. Example 5-4 Swizzling Data Using Intrinsics //Intrinsics version of data swizzle void swizzle_intrin (Vertex_aos *in, Vertex_soa *out, int stride) __m128 x, y, z, w;...
Page 275
CAUTION. previous computations because the instructions bypass one part of the register. The same issue can occur with the use of an exclusive-OR function within an inner loop in order to clear a register: xorps xmm0, xmm0 Although the generated result of all zeros does not depend on the specific data contained in the source operand (that is, with itself always produces all zeros), the instruction cannot execute until the instruction that generates...
Data Deswizzling In the deswizzle operation, we want to arrange the SoA format back into AoS format so the memory as instructions to regenerate the into its corresponding memory location using by another movlps Example 5-5 illustrates the deswizzle function: Example 5-5 Deswizzling Single-Precision SIMD Data void deswizzle_asm(Vertex_soa *in, Vertex_aos *out)
Example 5-5 Deswizzling Single-Precision SIMD Data (continued) unpcklps xmm5, xmm4 unpckhps xmm0, xmm4 movlps [edx+8], xmm5 movhps [edx+24], xmm5 movlps [edx+40], xmm0 movhps [edx+56], xmm0 // DESWIZZLING ENDS HERE You may have to swizzle data in the registers, but not in memory. This occurs when two different functions need to process the data in different layout.
Example 5-8 illustrates how to use MMX technology code for copying or shuffling. Example 5-8 Using MMX Technology Code for Copying or Shuffling movq movq movq punpckhdq punpckldq movq movq movq movq movq punpckhdq punpckldq movq movq Horizontal ADD Using SSE Although vertical computations use the SIMD performance better than horizontal computations do, in some cases, the code must use a horizontal operation.
avoided since there is a penalty associated with writing this register; typically, through the use of the the rounding control in Flush-to-Zero and Denormals-are-Zero Modes The flush-to-zero (FTZ) and denormals-are-zero (DAZ) mode are not compatible with IEEE Standard 754. They are provided to improve performance for applications where underflow is common and where the generation of a denormalized result is not necessary.
Figure 5-4 Asymmetric Arithmetic Operation of the SSE3 Instruction X1 + Y1 Figure 5-5 Horizontal Arithmetic Operation of the SSE3 Instruction HADDPD Y0 + Y1 SSE3 and Complex Arithmetics The flexibility of SSE3 in dealing with AOS-type of data structure can be demonstrated by the example of multiplication and division of complex numbers.
instructions to perform multiplications of single-precision complex numbers. Example 5-12 demonstrates using SSE3 instructions to perform division of complex numbers. Example 5-11 Multiplication of Two Pair of Single-precision Complex Number // Multiplication of // a + i b can be stored as a data structure movsldup xmm0, Src1;...
Example 5-12 Division of Two Pair of Single-precision Complex Number // Division of (ak + i bk ) / (ck + i dk ) movshdup xmm0, Src1; load imaginary parts into the movaps xmm1, src2; load the 2nd pair of complex values, mulps xmm0, xmm1;...
SSE3 and Horizontal Computation Sometimes the AOS type of data organization are more natural in many algebraic formula. SSE3 enhances the flexibility of SIMD programming for applications that rely on the horizontal computation model. SSE3 offers several instructions that are capable of horizontal arithmetic operations.
SIMD Optimizations and Microarchitectures Pentium M, Intel Core Solo and Intel Core Duo processors have a different microarchitecture than Intel NetBurst following sub-section discusses optimizing SIMD code that target Intel Core Solo and Intel Core Duo processors.
Page 290
Packed horizontal SSE3 instructions (haddps and hsubps) can simplify the code sequence for some tasks. However, these instruction consist of more than five micro-ops on Intel Core Solo and Intel Core Duo processors. Care must be taken to ensure the latency and decoding penalty of the horizontal instruction does not offset any algorithmic benefits.
Optimizing Cache Usage Over the past decade, processor speed has increased more than ten times. Memory access speed has increased at a slower pace. The resulting disparity has made it important to tune applications in one of two ways: either (a) a majority of the data accesses are fulfilled from processor caches, or (b) effectively masking memory latency to utilize peak memory bandwidth as much as possible.
The examples of such Intel _mm_prefetch details on these intrinsics, refer to the Intel® C++ Compiler User’s Guide, doc. number 718195. NOTE. In a number of cases presented in this chapter,...
Page 293
• Facilitate compiler optimization: — Minimize use of global variables and pointers — Minimize use of complex control flow — Use the modifier, avoid const — Choose data types carefully (see below) and avoid type casting. • Use cache blocking techniques (for example, strip mining): —...
Hardware Prefetching of Data The Pentium 4, Intel Xeon, Pentium M, Intel Core Solo and Intel Core Duo processors implement a hardware automatic data prefetcher which monitors application data access patterns and prefetches data automatically.
Data Reads for load streams. Other than the items 2 and 4 discussed above, most other characteristics also apply to Pentium M, Intel Core Solo and Intel Core Duo processors. The hardware prefetcher implemented in the Pentium M processor fetches data to a second level cache.
Data reference patterns can be classified as follows: Temporal Spatial Non-temporal These data characteristics are used in the discussions that follow. Prefetch This section discusses the mechanics of the software prefetch instructions. In general, software prefetch instructions should be used to supplement the practice of tuning a access pattern to suit the automatic hardware prefetch mechanism.
Page 297
instruction is implementation-specific; applications need prefetch to be tuned to each implementation to maximize performance. Using the NOTE. recommended only if data does not fit in cache. instructions merely provide a hint to the hardware, and prefetch they will not generate exceptions or faults except for a few special cases (see the “Prefetch and Load Instructions”...
The Prefetch Instructions – Pentium 4 Processor Implementation Streaming SIMD Extensions include four flavors of instructions, one non-temporal, and three temporal. They correspond to two types of operations, temporal and non-temporal. The non-temporal instruction is prefetchnta The temporal instructions are prefetcht0 prefetcht1 prefetcht2...
Currently, the prefetch than preloading because it: • has no destination register, it only updates cache lines. • does not stall the normal instruction retirement. • does not affect the functional behavior of the program. • has no cache line split accesses. •...
The Non-temporal Store Instructions This section describes the behavior of streaming stores and reiterates some of the information presented in the previous section. In Streaming SIMD Extensions, the maskmovq stores. With regard to memory characteristics and ordering, they are similar mostly to the Write-Combining ( •...
(with semantics). Note that the approaches (separate or combined) can be different for future processors. The Pentium 4, Intel Core Solo and Intel Core Duo processors implement the latter policy (of Optimizing Cache Usage ) or memory type range registers...
evicting data from all processor caches). The Pentium M processor implements a combination of both approaches. If the streaming store hits a line that is present in the first-level cache, the store data is combined in place within the first-level cache.
Optimizing Cache Usage possible. This behavior should be considered reserved, and dependence on the behavior of any particular implementation risks future incompatibility. Streaming Store Usage Models The two primary usage domains for streaming store are coherent requests and non-coherent requests. Coherent Requests Coherent requests are normal loads and stores to system memory, which may also hit cache lines present in another processor in a...
In case the region is not mapped as in-place in the cache and a subsequent data being written to system memory. Explicitly mapping the region as in this case ensures that any data read from this region will not be placed in the processor’s caches.
maskmovq/maskmovdqu integer in an MMX technology or Streaming SIMD Extensions register) instructions store data from a register to the location specified by the register. The most significant bit in each byte of the second mask register is used to selectively write the data of the first register on a per-byte basis.
The degree to which a consumer of data knows that the data is weakly-ordered can vary for these cases. As a result, the instruction should be used to ensure ordering between routines that produce weakly-ordered data and routines that consume this data. The instruction provides a performance-efficient way by ensuring sfence the ordering when every...
The clflush Instruction The cache line associated with the linear address specified by the value of byte address is invalidated from all levels of the processor cache hierarchy (data and instruction). The invalidation is broadcast throughout the coherence domain. If, at any level of the cache hierarchy, the line is inconsistent with memory (dirty) it is written to memory before invalidation.
Example 6-1 Pseudo-code for Using cflush while (!buffer_ready} {} mfence for(i=0;i<num_cachelines;i+=cacheline_size) { clflush (char *)((unsigned int)buffer + i) mfence prefnta buffer[0]; VAR = buffer[0]; Memory Optimization Using Prefetch The Pentium 4 processor has two mechanisms for data prefetch: software-controlled prefetch and an automatic hardware prefetch. Software-controlled Prefetch The software-controlled prefetch is enabled using the four prefetch instructions introduced with Streaming SIMD Extensions instructions.
Hardware Prefetch The automatic hardware prefetch, can bring cache lines into the unified last-level cache based on prior data misses. The automatic hardware prefetcher will attempt to prefetch two cache lines ahead of the prefetch stream. This feature is introduced with the Pentium 4 processor. The characteristics of the hardware prefetching are as follows: •...
• May consume extra system bandwidth if the application’s memory traffic has significant portions with strides of cache misses greater than the trigger distance threshold of hardware prefetch (large-stride memory traffic). • Effectiveness with existing applications depends on the proportions of small-stride versus large-stride accesses in the application’s memory traffic.
Example 6-2 Populating an Array for Circular Pointer Chasing with Constant Stride register char ** p; *next; // Populating pArray for circular pointer char p = ( char **)*p; loads a value pointing to next load p = (char **)&pArray; for (i = 0;...
Figure 6-1 Effective Latency Reduction as a Function of Access Stride U p p e r b o u n d o f P o in te r -C h a s in g L a te n c y R e d u c tio n 1 2 0 % 1 0 0 % 8 0 %...
execution units sit idle and wait until data is returned. On the other hand, the memory bus sits idle while the execution units are processing vertices. This scenario severely decreases the advantage of having a decoupled architecture. Figure 6-2 Memory Access Latency and Execution Without Prefetch Execution Execution units idle pipeline...
The performance loss caused by poor utilization of resources can be completely eliminated by correctly scheduling the prefetch instructions appropriately. As shown in Figure 6-3, prefetch instructions are issued two vertex iterations ahead. This assumes that only one vertex gets processed in one iteration and a new data cache line is needed for each iteration.
• Balance single-pass versus multi-pass execution • Resolve memory bank conflict issues • Resolve cache management issues The subsequent sections discuss all the above items. Software Prefetch Scheduling Distance Determining the ideal prefetch placement in the code depends on many architectural parameters, including the amount of memory to be prefetched, cache lookup latency, system memory latency, and estimate of computation cycle.
lines of data per iteration. The PSD would need to be increased/decreased if more/less than two cache lines are used per iteration. Example 6-3 Prefetch Scheduling Distance top_loop: prefetchnta [edx + esi + 128*3] prefetchnta [edx*4 + esi + 128*3] .
Page 317
Optimizing Cache Usage This memory de-pipelining creates inefficiency in both the memory pipeline and execution pipeline. This de-pipelining effect can be removed by applying a technique called prefetch concatenation. With this technique, the memory access and execution can be fully pipelined and fully utilized.
Example 6-4 Using Prefetch Concatenation for (ii = 0; ii < 100; ii++) { for (jj = 0; jj < 32; jj+=8) { prefetch a[ii][jj+8] computation a[ii][jj] Prefetch concatenation can bridge the execution pipeline bubbles between the boundary of an inner loop and its associated outer loop. Simply by unrolling the last iteration out of the inner loop and specifying the effective prefetch address for data used in the following iteration, the performance loss of memory de-pipelining can be...
Minimize Number of Software Prefetches Prefetch instructions are not completely free in terms of bus cycles, machine cycles and resources, even though they require minimal clocks and memory bandwidth. Excessive prefetching may lead to performance penalties because issue penalties in the front-end of the machine and/or resource contention in the memory sub-system.
Page 320
Figure 6-5Figure demonstrates the effectiveness of software prefetches in latency hiding. The X axis indicates the number of computation clocks per loop (each iteration is independent). The Y axis indicates the execution time measured in clocks per loop. The secondary Y axis indicates the percentage of bus bandwidth utilization.
Figure 6-5 Memory Access Latency and Execution With Prefetch 16_por One load and one store stream 32_por 64_por 128_por None_por % Bus Utilization Computations per loop 2 Load streams, 1 store stream % Bus Utilization Computations per loop Optimizing Cache Usage 100.00% 90.00% 80.00%...
Mix Software Prefetch with Computation Instructions It may seem convenient to cluster all of the prefetch instructions at the beginning of a loop body or before a loop, but this can lead to severe performance degradation. In order to achieve best possible performance, prefetch instructions must be interspersed with other computational instructions in the instruction sequence rather than clustered together.
Software Prefetch and Cache Blocking Techniques Cache blocking techniques, such as strip-mining, are used to improve temporal locality, and thereby cache hit rate. Strip-mining is a one-dimensional temporal locality optimization for memory. When two-dimensional arrays are used in programs, loop blocking technique (similar to strip-mining but in two dimensions) can be applied for a better memory performance.
Figure 6-6 Cache Blocking – Temporally Adjacent and Non-adjacent Passes Dataset A Dataset A Dataset B Dataset B Temporally adjacent passes In the temporally-adjacent scenario, subsequent passes use the same data and find it already in second-level cache. Prefetch issues aside, this is the preferred situation.
Figure 6-7 shows how prefetch instructions and strip-mining can be applied to increase performance in both of these scenarios. Figure 6-7 Examples of Prefetch and Strip-mining for Temporally Adjacent and Non-Adjacent Passes Loops Temporally adjacent passes For Pentium 4 processors, the left scenario shows a graphical implementation of using ways of the second-level cache only (SM1 denotes strip mine one way of second-level), minimizing second-level cache pollution.
In scenario to the right, in Figure 6-7, keeping the data in one way of the second-level cache does not improve cache locality. Therefore, use to prefetch the data. This amortizes the latency of the prefetcht0 memory references in passes 1 and 2, and keeps a copy of the data in second-level cache, which reduces memory traffic and latencies for passes 3 and 4.
Without strip-mining, all the x,y,z coordinates for the four vertices must be re-fetched from memory in the second pass, that is, the lighting loop. This causes under-utilization of cache lines fetched during transformation loop as well as bandwidth wasted in the lighting loop. Now consider the code in Example 6-8 where strip-mining has been incorporated into the loops.
Table 6-1 summarizes the steps of the basic usage model that incorporates only software prefetch with strip-mining. The steps are: • Do strip-mining: partition loops so that the dataset fits into second-level cache. • prefetchnta into 32K (one way of second-level cache). Use dataset exceeds 32K.
happen to be powers of 2, aliasing condition due to finite number of way-associativity (see “Capacity Limits and Aliasing in Caches” in Chapter 2) will exacerbate the likelihood of cache evictions. Example 6-9 Using HW Prefetch to Improve Read-Once Memory Traffic Un-optimized image transpose // dest and src represent two-dimensional arrays for( i = 0;i <...
references enables the hardware prefetcher to initiate bus requests to read some cache lines before the code actually reference the linear addresses. Single-pass versus Multi-pass Execution An algorithm can use single- or multi-pass execution defined as follows: • Single-pass, or unlayered execution passes a single data element through an entire computation pipeline.
selected to ensure that the batch stays within the processor caches through all passes. An intermediate cached buffer is used to pass the batch of vertices from one stage or pass to the next one. Single-pass execution can be better suited to applications which limit the number of features that may be used at a given time.
The choice of single-pass or multi-pass can have a number of performance implications. For instance, in a multi-pass pipeline, stages that are limited by bandwidth (either input or output) will reflect more of this performance limitation in overall execution time. In contrast, for a single-pass approach, bandwidth-limitations can be distributed/ amortized across other computation-intensive stages.
In addition, the Pentium 4 processor takes advantage of the Intel C ++ Compiler that supports C ++ language-level features for the Streaming SIMD Extensions. The Streaming SIMD Extensions and MMX technology instructions provide intrinsics that allow you to optimize cache utilization.
Optimizing Cache Usage The following examples of using prefetching instructions in the operation of video encoder and decoder as well as in simple 8-byte memory copy, illustrate performance gain from using the prefetching instructions for efficient cache management. Video Encoder In a video encoder example, some of the data used during the encoding process is kept in the processor’s second-level cache, to minimize the number of reference streams that must be re-read from system memory.
Later, the processor re-reads the data using maximum bandwidth, yet minimizes disturbance of other cached temporal data by using the non-temporal (NTA) version of prefetch. Conclusions from Video Encoder and Decoder Implementation These two examples indicate that by using an appropriate combination of non-temporal prefetches and non-temporal stores, an application can be designed to lessen the overhead of memory transactions by preventing second-level cache pollution, keeping useful data in the...
The memory copy algorithm can be optimized using the Streaming SIMD Extensions with these considerations: • alignment of data • proper layout of pages in memory • cache size • interaction of the transaction lookaside buffer (TLB) with memory accesses •...
Using the 8-byte Streaming Stores and Software Prefetch Example 6-11 presents the copy algorithm that uses second level cache. The algorithm performs the following steps: Uses blocking technique to transfer 8-byte data from memory into second-level cache using the a time to fill a block. The size of a block should be less than one half of the size of the second-level cache, but large enough to amortize the cost of the loop.
Page 339
Example 6-11 A Memory Copy Routine Using Software Prefetch // copy 128 byte per loop for (j=kk; j<kk+NUMPERPAGE; j+=16) { _mm_stream_ps((float*)&b[j], _mm_load_ps((float*)&a[j])); _mm_stream_ps((float*)&b[j+2], _mm_load_ps((float*)&a[j+2])); _mm_stream_ps((float*)&b[j+4], _mm_load_ps((float*)&a[j+4])); _mm_stream_ps((float*)&b[j+6], _mm_load_ps((float*)&a[j+6])); _mm_stream_ps((float*)&b[j+8], _mm_load_ps((float*)&a[j+8])); _mm_stream_ps((float*)&b[j+10], _mm_load_ps((float*)&a[j+10])); _mm_stream_ps((float*)&b[j+12], _mm_load_ps((float*)&a[j+12])); _mm_stream_ps((float*)&b[j+14], _mm_load_ps((float*)&a[j+14])); // finished copying one block // finished copying N elements _mm_sfence();...
The instruction, table entry for array, and This is essentially a prefetch itself, as a cache line is filled from that memory location with this instruction. Hence, the prefetching starts from in this loop. kk+4 This example assumes that the destination of the copy is not temporally adjacent to the code.
If CPUID support the function leaf with input EAX = 4, this is referred to as the deterministic cache parameter leaf of CPUID (see CPUID instruction in IA-32 Intel® Architecture Software Developer’s Manual, Volume 2A). Software can use the deterministic cache parameter leaf to...
query each level of the cache hierarchy. Enumeration of each cache level is by specifying an index value (starting form 0) in the ECX register. The list of parameters is shown in Table 6-3. Table 6-3 Deterministic Cache Parameters Leaf Bit Location EAX[4:0] EAX[7:5]...
• Determine multi-threading resource topology in an MP system (See Section 7.10 of IA-32 Intel® Architecture Software Developer’s Manual, Volume 3A). • Determine cache hierarchy topology in a platform using multi-core processors (See Example 7-13). • Manage threads and processor affinities.
platform, software can extract information on the number and the identities of each logical processor sharing that cache level and is made available to application by the OS. This is discussed in detail in “Using Shared Execution Resources in a Processor Core” in Chapter 7 and Example 7-13.
The number of logical processors present in each package can also be obtained from CPUID. The application must check how many logical processors are enabled and made available to application at runtime by making the appropriate operating system calls. See the IA-32 Intel® Architecture Software Developer’s Manual, Volume 2A for more information.
cores but shared by two logical processors in the same core if Hyper-Threading Technology is enabled. This chapter covers guidelines that apply to either situations. This chapter covers • Performance characteristics and usage models, • Programming models for multithreaded applications, •...
Figure 7-1 illustrates how performance gains can be realized for any workload according to Amdahl’s law. The bar in Figure 7-1 represents an individual task unit or the collective workload of an entire application. In general, the speed-up of running multiple threads on an MP systems with N physical processors, over single-threaded execution, can be expressed as: RelativeResponse...
When optimizing application performance in a multithreaded environment, control flow parallelism is likely to have the largest impact on performance scaling with respect to the number of physical processors and to the number of logical processors per physical processor. If the control flow of a multi-threaded application contains a workload in which only 50% can be executed in parallel, the maximum performance gain using two physical processors is only 33%, compared to using a single processor.
Page 351
terms of time of completion relative to the same task when in a single-threaded environment) will vary, depending on how much shared execution resources and memory are utilized. For development purposes, several popular operating systems (for example Microsoft Windows* XP Professional and Home, Linux* distributions using kernel 2.4.19 or later can manage the task scheduling and the balancing of shared execution resources within each physical processor to maximize the throughput.
When two applications are employed as part of a multi-tasking workload, there is little synchronization overhead between these two processes. It is also important to ensure each application has minimal synchronization overhead within itself. An application that uses lengthy spin loops for intra-process synchronization is less likely to benefit from Hyper-Threading Technology in a multi-tasking workload.
Parallel Programming Models Two common programming models for transforming independent task requirements into application threads are: • domain decomposition • functional decomposition Domain Decomposition Usually large compute-intensive tasks use data sets that can be divided into a number of small subsets, each having a large degree of computational independence.
IA-32 processor supporting Hyper-Threading Technology. Specialized Programming Models Intel Core Duo processor offers a second-level cache shared by two processor cores in the same physical package. This provides opportunities for two application threads to access some application data while minimizing the overhead of bus traffic.
overhead when buffers are exchanged between the producer and consumer. To achieve optimal scaling with the number of cores, the synchronization overhead must be kept low. This can be done by ensuring the producer and consumer threads have comparable time constants for completing each incremental task prior to exchanging buffers.
The gap between each task represents synchronization overhead. The decimal number in the parenthesis represents a buffer index. On an Intel Core Duo processor, the producer thread can store data in the second-level cache to allow the consumer thread to continue work requiring minimal bus traffic.
Example 7-2 Basic Structure of Implementing Producer Consumer Threads (a) Basic structure of a producer thread function void producer_thread() int iter_num = workamount - 1; // make local copy int mode1 = 1; produce(buffs[0],count); // placeholder function while (iter_num--) { Signal(&signal1,1);...
corresponding task to use its designated buffer. Thus, the producer and consumer tasks execute in parallel in two threads. As long as the data generated by the producer reside in either the first or second level cache of the same core, the consumer can access them without incurring bus traffic.
Example 7-3 Thread Function for an Interlaced Producer Consumer Model // master thread starts the first iteration, the other thread must wait // one iteration void producer_consumer_thread(int master) int mode = 1 - master; // track which thread and its designated buffer index unsigned int iter_num = workamount >>...
(API) is not the only method for creating multithreaded applications. New tools such as the Intel available with capabilities that make the challenge of creating multithreaded application easier. Two features available in the latest Intel Compilers are: • generating multithreaded code using OpenMP* directives •...
Page 361
Thread Profiler. Thread Profiler is a plug-in data collector for the Intel VTune Performance Analyzer. Use it to analyze threading performance and identify parallel performance bottlenecks. It graphically illustrates what each thread is doing at various levels of detail using a hierarchical summary.
Optimization Guidelines This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance): • thread synchronization • bus utilization • memory optimization • front end optimization • execution resource optimization Practices associated with each area are listed in this section. Guidelines for each area are discussed in greater depth in sections that follow.
• Place each synchronization variable alone, separated by 128 bytes or in a separate cache line. See “Thread Synchronization” for more details. Key Practices of System Bus Optimization Managing bus traffic can significantly impact the overall performance of multithreaded software and MP systems. Key practices of system bus optimization for achieving high data throughput and quick response are: •...
• Adjust the private stack of each thread in an application so the spacing between these stacks is not offset by multiples of 64 KB or 1 MB (prevents unnecessary cache line evictions) when targeting IA-32 processors supporting Hyper-Threading Technology. •...
• For each processor supporting Hyper-Threading Technology, consider adding functionally uncorrelated threads to increase the hardware resource utilization of each physical processor package. See “Using Thread Affinities to Manage Shared Platform Resources” for more details. Generality and Performance Impact The next five sections cover the optimization techniques in detail. Recommendations discussed in each section are ranked by importance in terms of estimated local impact and generality.
The best practice to reduce the overhead of thread synchronization is to start by reducing the application’s requirements for synchronization. Intel Thread Profiler can be used to profile the execution timeline of each thread and detect situations where performance is impacted by frequent occurrences of synchronization overhead.
Profiler can be very useful in dealing with multi-threading functional correctness issue and performance impact under multi-threaded execution. Additional information on the capabilities of Intel Thread Checker and Thread Profiler are described in Appendix A. Table 7-1 is useful for comparing the properties of three categories of synchronization objects available to multi-threaded applications.
Table 7-1 Properties of Synchronization Objects (Contd.) Operating System Synchronization Characteristics Objects Miscellaneous Some objects provide intra-process synchronization and some are for inter-process communication Recommended 1. # of active threads use conditions > # of cores. 2. Waiting thousands of cycles for a signal.
Page 369
This penalty occurs on the Pentium M processor, the Intel Core Solo and Intel Core Duo processors. However, the penalty on these processors is small compared with penalties suffered on the Pentium 4 and Intel Xeon processors.
Example 7-4 Spin-wait Loop and PAUSE Instructions (a) An un-optimized spin-wait loop experiences performance penalty when exiting the loop. It consumes execution resources without contributing computational work. do { // This loop can run faster than the speed of memory access, // other worker threads cannot finish modifying sync_var until // outstanding loads from the spinning loops are resolved.
PAUSE loop is shown in Example 7-4(b). The instruction is compatible PAUSE with all IA-32 processors. On IA-32 processors prior to Intel NetBurst microarchitecture, the instruction is essentially a instruction. PAUSE Additional examples of optimizing spin-wait loops using the PAUSE instruction are available in Application Note AP-949 “Using...
To reduce the performance penalty, one approach is to reduce the likelihood of many threads competing to acquire the same lock. Apply a software pipelining technique to handle data that must be shared between multiple threads. Instead of allowing multiple threads to compete for a given lock, no more than two threads should have write access to a given lock.
Page 373
If an application thread must remain idle for a long time, the application should use a thread blocking API or other method to release the idle processor. The techniques discussed here apply to traditional MP system, but they have an even higher impact on IA-32 processors that support Hyper-Threading Technology.
IA-32 Intel® Architecture Optimization Avoid Coding Pitfalls in Thread Synchronization Synchronization between multiple threads must be designed and implemented with care to achieve good performance scaling with respect to the number of discrete processors and the number of logical processor per physical processor. No single technique is a universal solution for every synchronization situation.
Example 7-5 Coding Pitfall using Spin Wait Loop (a) A spin-wait loop attempts to release the processor incorrectly. It experiences a performance penalty if the only worker thread and the control thread runs on the same physical processor package. // Only one worker thread is running, // the control loop waits for the worker thread to complete.
Prevent Sharing of Modified Data and False-Sharing On an Intel Core Duo processor, sharing of modified data incurs a performance penalty when a thread running on one core tries to read or write data that is currently present in modified state in the first level cache of the other core.
User/Source Coding Rule 24. (H impact, M generality) Beware of false sharing within a cache line (64 bytes on Intel Pentium 4, Intel Xeon, Pentium M, Intel Core Duo processors), and within a sector (128 bytes on Pentium 4 and Intel Xeon processors).
• Objects allocated dynamically by different threads may share cache lines. Make sure that the variables used locally by one thread are allocated in a manner to prevent sharing the cache line with other threads. Example 7-6 Placement of Synchronization and Regular Variables regVar;...
• In managed environments that provide automatic object allocation, the object allocators and garbage collectors are responsible for layout of the objects in memory so that false sharing through two objects does not happen. • Provide classes such that only one thread writes to each object field and close object fields, in order to avoid false sharing.
Conserve Bus Bandwidth In a multi-threading environment, bus bandwidth may be shared by memory traffic originated from multiple bus agents (These agents can be several logical processors and/or several processor cores). Preserving the bus bandwidth can improve processor scaling performance. Also, effective bus bandwidth typically will decrease if there are significant large-stride cache-misses.
Be careful when parallelizing code sections with data sets that results in the total working set exceeding the second-level cache and /or consumed bandwidth exceeding the capacity of the bus. On an Intel Core Duo processor, if only one thread is using the second-level cache...
Avoid Excessive Software Prefetches Pentium 4 and Intel Xeon Processors have an automatic hardware prefetcher. It can bring data and instructions into the unified second-level cache based on prior reference patterns. In most situations, the hardware prefetcher is likely to reduce system memory latency without explicit intervention from software prefetches.
Multi-Core and Hyper-Threading Technology latency of scattered memory reads can be improved by issuing multiple memory reads back-to-back to overlap multiple outstanding memory read transactions. The average latency of back-to-back bus reads is likely to be lower than the average latency of scattered reads interspersed with other bus transactions.
Frequently, multiple partial writes to WC memory can be combined into full-sized writes using a software write-combining technique to separate WC store operations from competing with WB store traffic. To implement software write-combining, uncacheable writes to memory with the WC attribute are written to a small, temporary buffer (WB type) that fits in the first level data cache.
Multi-Core and Hyper-Threading Technology block size for loop blocking should be determined by dividing the target cache size by the number of logical processors available in a physical processor package. Typically, some cache lines are needed to access data that are not part of the source or destination buffers used in cache blocking, so the block size can be chosen between one quarter to one half of the target cache (see also, Chapter 3).
Figure 7-5, is to minimize bus traffic while sharing data between the producer and the consumer using a shared second-level cache. On an Intel Core Duo processor and when the work buffers are small enough to fit within the first-level cache, re-ordering of producer and consumer tasks are necessary to achieve optimal performance.
Example 7-8 shows the batched implementation of the producer and consumer thread functions. Example 7-8 Batched Implementation of the Producer Consumer Threads void producer_thread() int iter_num = workamount - batchsize; int mode1; for (mode1=0; mode1 < batchsize; mode1++) produce(buffs[mode1],count); } while (iter_num--) Signal(&signal1,1);...
Pentium 4 processor performance monitoring events. Appendix B includes an updated list of Pentium 4 processor performance metrics. These metrics are based on events accessed using the Intel VTune performance analyzer. Performance penalties associated with 64 KB aliasing are applicable mainly to current processor implementations of Hyper-Threading Technology or Intel NetBurst microarchitecture.
Preventing Excessive Evictions in First-Level Data Cache Cached data in a first-level data cache are indexed to linear addresses but physically tagged. Data in second-level and third-level caches are tagged and indexed to physical addresses. While two logical processors in the same physical processor package execute in separate linear address space, the same processors can reference data at the same linear address in two address spaces but mapped to different physical addresses.
(when using IA-32 processors supporting Hyper-Threading Technology). For parallel applications written to run with OpenMP, the OpenMP runtime library in Intel KAP/Pro Toolset automatically provides the stack offset adjustment for each thread. 7-44 Example 7-9 shows a code fragment...
Example 7-9 Adding an Offset to the Stack Pointer of Three Threads Void Func_thread_entry(DWORD *pArg) {DWORD StackOffset = *pArg; DWORD var1; // The local variable at this scope may not benefit DWORD var2; // from the adjustment of the stack pointer that ensue. // Call runtime library routine to offset stack pointer.
Example 7-9 Adding an Offset to the Stack Pointer of Three Threads (Contd.) { DWORD Stack_offset, ID_Thread1, ID_Thread2, ID_Thread3; Stack_offset = 1024; // Stack offset between parent thread and the first child thread. ID_Thread1 = CreateThread(Func_thread_entry, &Stack_offset); // Call OS thread API. Stack_offset = 2048;...
However, the buffer space does enable the first-level data cache to be shared cooperatively when two copies of the same application are executing on the two logical processors in a physical processor package. To establish a suitable stack offset for two instances of the same application running on two logical processors in the same physical processor package, the stack pointer can be adjusted in the entry function of the application using the technique shown in Example 7-10.
For dual-core processors where the second-level unified cache is shared by two processor cores (e.g. Intel Core Duo processor), multi-threaded software should consider the increase in code working set due to two threads fetching code from the unified cache as part of front-end and cache optimization.
On Hyper-Threading-Technology-enabled processors, excessive loop unrolling is likely to reduce the Trace Cache’s ability to deliver high bandwidth μop streams to the execution engine. Optimization for Code Size When the Trace Cache is continuously and repeatedly delivering μop traces that are pre-built, the scheduler in the execution engine can dispatch μops for execution at a high rate and maximize the utilization of available execution resources.
Page 396
APIC_ID (See Section 7.10 of IA-32 Intel Architecture Software Developer’s Manual, Volume 3A for more details) associated with a logical processor. The three levels are: • physical processor package. A PACKAGE_ID label can be used to distinguish different physical packages within a cluster.
Affinity masks can be used to optimize shared multi-threading resources. Example 7-11 Assembling 3-level IDs, Affinity Masks for Each Logical Processor // The BIOS and/or OS may limit the number of logical processors // available to applications after system boot. // The below algorithm will compute topology for the logical processors // visible to the thread that is computing it.
Page 398
Example 7-11 Assembling 3-level IDs, Affinity Masks for Each Logical Processor (Contd.) if (ThreadAffinityMask & SystemAffinity){ Set thread to run on the processor specified in ThreadAffinityMask. Wait if necessary and ensure thread is running on specified processor. apic_conf[ProcessorNum].initialAPIC_ID = GetInitialAPIC_ID(); Extract the Package, Core and SMT ID as explained in three level extraction algorithm.
Page 399
Multi-Core and Hyper-Threading Technology first to the primary logical processor of each processor core. This example is also optimized to the situations of scheduling two memory-intensive threads to run on separate cores and scheduling two compute-intensive threads on separate cores. User/Source Coding Rule 39.
Example 7-12 Assembling a Look up Table to Manage Affinity Masks and Schedule Threads to Each Core First AFFINITYMASK LuT[64]; // A Lookup table to retrieve the affinity // mask we want to use from the thread // scheduling sequence index. int index =0;...
Example 7-13 Discovering the Affinity Masks for Sibling Logical Processors Sharing the Same Cache // Logical processors sharing the same cache can be determined by bucketing // the logical processors with a mask, the width of the mask is determined // from the maximum number of logical processors sharing that cache level.
Page 402
Example 7-13 Discovering the Affinity Masks for Sibling Logical Processors Sharing the Same Cache (Contd.) PackageID[ProcessorNUM] = PACKAGE_ID; CoreID[ProcessorNum] = CORE_ID; SmtID[ProcessorNum] = SMT_ID; CacheID[ProcessorNUM] = CACHE_ID; // Only the target cache is stored in this example ProcessorNum++; ThreadAffinityMask <<= 1; NumStartedLPs = ProcessorNum;...
Page 403
Example 7-13 Discovering the Affinity Masks for Sibling Logical Processors Sharing the Same Cache (Contd.) For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) { ProcessorMask << = 1; For (i = 0; i < CacheNum; i++) { // We may be comparing bit-fields of logical processors // residing in a different modular boundary of the cache // topology, the code below assume symmetry across this // modular boundary.
Page 404
Processor topology and an algorithm for software to identify the processor topology are discussed in the IA-32 Intel® Architecture Software Developer’s Manual, Volume 3A. Typically the bus system is shared by multiple agents at the SMT level and at the processor core level of the processor topology.
Such performance metrics are described in Appendix B and can be accessed using the Intel VTune Performance Analyzer. An event ratio like non-halted cycles per instructions retired (non-halted CPI) and non-sleep CPI can be useful in directing code-tuning efforts.
Page 406
Non-halted CPI can correlate to the resource utilization of an application thread, if the application thread is affinitized to a fixed logical processor. 10. In current implementations of processors based on Intel NetBurst microarchitecture, the theoretical lower bound for either non-halted CPI or non-sleep CPI is 1/3. Practical applications rarely achieve any value close to the lower bound.
Page 407
Multi-Core and Hyper-Threading Technology Using a function decomposition threading model, a multithreaded application can pair up a thread with critical dependence on a low-throughput resource with other threads that do not have the same dependency. User/Source Coding Rule 40. (M impact, L generality) If a single thread consumes half of the peak bandwidth of a specific execution unit (e.g.
Page 408
IA-32 Intel® Architecture Optimization Write-combining buffers are another example of execution resources shared between two logical processors. With two threads running simultaneously on a processor supporting Hyper-Threading Technology, s of both threads count toward the limit of four write write-combining buffers. For example: if an inner loop that writes to three separate areas of memory per iteration is run by two threads simultaneously, the total number of cache lines written could be six.
64-bit Mode Coding Guidelines Introduction This chapter describes coding guidelines for application software written to run in 64-bit mode. These guidelines should be considered as an addendum to the coding guidelines described in Chapter 2 through 7. Software that runs in either compatibility mode or legacy non-64-bit modes should follow the guidelines described in Chapter 2 through 7.
This optimization holds true for the lower 8 general purpose registers: EAX, ECX, EBX, EDX, ESP, EBP, ESI, EDI. To access the data in registers r9-r15, the REX prefix is required. Using the 32-bit form there does not reduce code size. Assembly/Compiler Coding rule Use the 32-bit versions of instructions in 64-bit mode to reduce code size unless the 64-bit version is necessary to access 64-bit data or additional...
If the compiler can determine at compile time that the result of a multiply will not exceed 64 bits, then the compiler should generate the multiply instruction that produces a 64-bit result. If the compiler or assembly programmer can not determine that the result will be less than 64 bits, then a multiply that produces a 128-bit result is necessary.
Can be replaced with: movsx r8, r9w movsx r8, r10b In the above example, the moves to r8w and r8b both require a merge to preserve the rest of the bits in the register. There is an implicit real dependency on r8 between the 'mov r8w, r9w' and 'mov r8b, r10b'. Using movsx breaks the real dependency and leaves only the output dependency, which the processor can eliminate through renaming.
Page 413
IMUL RAX, RCX The 64-bit version above is more efficient than using the following 32-bit version: MOV EAX, DWORD PTR[X] MOV ECX, DWORD PTR[Y] IMUL ECX In the 32-bit case above, EAX is required to be a source. The result ends up in the EDX:EAX pair instead of in a single 64-bit register.
Use the 32-bit versions of CTVSI2SS and CVTSI2SD when possible. Using Software Prefetch Intel recommends that software developers follow the recommendations in Chapter 2 and Chapter 6 when considering the choice of organizing data access patterns to take advantage of the hardware prefetcher (versus using software prefetch).
P-states to facilitate management of active power consumption; and several C-state types static power consumption. Power saving techniques applicable to mobile platforms, such as Intel Centrino mobile technology or Intel Centrino Duo mobile technology, have rich subjects; only processor-related techniques are covered in this manual.
Pentium M, Intel Core Solo and Intel Core Duo processors implement features designed to enable the reduction of active power and static power consumption. These include: • Enhanced Intel SpeedStep (OS) to program a processor to transition to lower frequency and/or voltage levels while executing a workload.
to accommodate demand and adapt power consumption. The interaction between the OS power management policy and performance history is described below: Demand is high and the processor works at its highest possible frequency (P0). Demand decreases, which the OS recognizes after some delay; the OS sets the processor to a lower frequency (P1).
ACPI C-States When computational demands are less than 100%, part of the time the processor is doing useful work and the rest of the time it is idle. For example, the processor could be waiting on an application time-out set by a Sleep() function, waiting for a web server response, or waiting for a user mouse click.
Page 419
The index of a C-state type designates the depth of sleep. Higher numbers indicate a deeper sleep state and lower power consumption. They also require more time to wake up (higher exit latency). C-state types are described below: • C0: The processor is active and performing computations and executing instructions.
L2 cache to maintain its state. Pentium M processor can be detected by CPUID signature with family 6, model 9 or 13, Intel Core Solo and Intel Core Duo processor has CPUID signature with family 6, model 14. provide...
• In an Intel Core Solo or Duo processor, after staying in C4 for an extended time, the processor may enter into a Deep C4 state to save additional static power.. The processor reduces voltage to the minimum level required to safely maintain processor context.
Adjust Performance to Meet Quality of Features When a system is battery powered, applications can extend battery life by reducing the performance or quality of features, turning off background activities, or both. Implementing such options in an application increases the processor idle time. Processor power consumption when idle is significantly lower than when active, resulting in longer battery life.
PeekMessage(). Use WaitMessage() to suspend the thread until a message is in the queue. ® Intel Mobile Platform Software Development Kit of APIs for mobile software to manage and optimize power consumption of mobile processor and other components in the platform.
(usually that equates to reducing the number of instructions that the processor needs to execute, or optimizing application performance). Optimizing an application starts with having efficient algorithms and then improving them using Intel software development tools, such as ® Intel VTune™ Performance Analyzers, Intel Performance Libraries.
disk operations over time. Use the GetDevicePowerState() Windows API to test disk state and delay the disk access if it is not spinning. Handling Sleep State Transitions In some cases, transitioning to a sleep state may harm an application. For example, suppose an application is in the middle of using a file on the network when the system enters suspend mode.
Using Enhanced Intel SpeedStep Use Enhanced Intel SpeedStep Technology to adjust the processor to operate at a lower frequency and save energy. The basic idea is to divide computations into smaller pieces and use OS power management policy to effect a transition to higher P-states.
Page 427
Power Optimization for Mobile Usages The same application can be written in such a way that work units are divided into smaller granularity, but scheduling of each work unit and Sleep() occurring at more frequent intervals (e.g. 100 ms) to deliver the same QOS (operating at full performance 50% of the time).
Instead, use longer idle periods to allow the processor to enter a deeper low power mode. ® Enabling Intel Enhanced Deeper Sleep In typical mobile computing usages, the processor is idle most of the time.
C-state type. The lower-numbered state type is usually C2, but may even be C0. The situation is significantly improved in the Intel Core Solo processor (compared to previous generations of the Pentium M processors), but polling will likely prevent the processor from entering into highest-numbered, processor-specific C-state.
IA-32 Intel® Architecture Optimization thread enables the physical processor to operate at lower frequency relative to a single-threaded version. This in turn enables the processor to operate at a lower voltage, saving battery life. Note that the OS views each logical processor or core in a physical processor as a separate entity and computes CPU utilization independently for each logical processor or core.
demands only 50% of processor resources (based on idle history). The processor frequency may be reduced by such multi-core unaware P-state coordination, resulting in a performance anomaly. See Figure 9-5: Figure 9-5 Thread Migration in a Multi-Core Processor active Core 1 Idle active Core 2...
Thread 2 (core 2) Sleep Active Sleep 2. Enabling both core to take advantage of Intel Enhanced Deeper Sleep: To best utilize processor-specific C-state (e.g., Intel Enhanced Deeper Sleep) to conserve battery life in multithreaded applications, a multi-threaded applications should synchronize threads to work simultaneously and sleep simultaneously using OS synchronization primitives.
Page 433
Intel Core Duo processor provides an event for this purpose. The event (Serial_Execution_Cycle) increments under the following conditions: — The core is actively executing code in C0 state, — The second core in the physical processor is in an idle state (C1-C4).
Application Performance Tools Intel offers an array of application performance tools that are optimized to take advantage of the Intel architecture (IA)-based processors. This appendix introduces these tools and explains their capabilities for developing the most efficient programs without having to write assembly code.
Microsoft .NET IDE. In Linux environment, the Intel C++ Compilers are binary compatible with the corresponding version of gcc. The Intel C++ compiler may be used from the Borland* IDE, or standalone, like the Fortran compiler. All compilers allow you to optimize your code by using special optimization options described in this section.
Vectorization, processor dispatch, inter-procedural optimization, profile-guided optimization and OpenMP parallelism are all supported by the Intel compilers and can significantly aid the performance of an application. The most general optimization options are them enables a number of specific optimization options. In most cases,...
Code produced will run on any Intel architecture 32-bit processor, but will be optimized specifically for the targeted processor. Automatic Processor Dispatch Support (-Qx[extensions] and -Qax[extensions]) -Qx[extensions] support to generate code that is specific to processor-instruction extensions. The corresponding options on Linux are -x[extensions] and -ax[extensions].
Vectorizer Switch Options The Intel C++ and Fortran Compiler can vectorize your code using the vectorizer switch options. The options that enable the vectorizer are -Qx[M,K,W,B,P] and -Qax[M,K,W,B,P] compiler provides a number of other vectorizer switch options that allow you to control vectorization.
Multithreading with OpenMP* Both the Intel C++ and Fortran Compilers support shared memory parallelism via OpenMP compiler directives, library functions and environment variables. OpenMP directives are activated by the compiler switch -Qopenmp User's Guides available with the Intel C++ and Fortran Compilers.
Profile-guided optimization is particularly beneficial for the Pentium 4 and Intel Xeon processor family. It greatly enhances the optimization decisions the compiler makes regarding instruction cache utilization and memory paging. Also, because PGO uses execution-time information to...
Repeat the instrumentation compilation if you make many changes to your source files after execution and before feedback compilation. For further details on the interprocedural and profile-guided optimizations, refer to the Intel C++ Compiler User’s Guide. ® Intel VTune™ Performance Analyzer The Intel VTune Performance Analyzer is a powerful software-profiling tool for Microsoft Windows and Linux.
Sampling Sampling allows you to profile all active software on your system, including operating system, device driver, and application software. It works by occasionally interrupting the processor and collecting the instruction address, process ID, and thread ID. After the sampling activity completes, the VTune analyzer displays the data by process, thread, software module, function, or line of source.
The VTune analyzer indicates where micro architectural events, specific to the Pentium 4, Pentium M and Intel Xeon processors, occur the most often. On Pentium M processors, the VTune analyzer can collect two...
Application Performance Tools different events at a time. The number of the events that the VTune analyzer can collect at once on the Pentium 4 and Intel Xeon processor depends on the events selected. Event-based samples are collected after a specific number of processor events have occurred.
Page 446
Hardware prefetch mechanisms can be controlled on demand using the model-specific register IA32_MISC_ENABLES. See Appendix B of the IA-32 Intel® Architecture Software Developer’s Manual, Volume 3B describes the specific bit locations of the IA32_MISC_ENABLES MSR.
Application Performance Tools stride inefficiency is most prominent on memory traffic. A useful indicator for large-stride inefficiency in a workload is to compare the ratio between bus read transactions and the number of DTLB pagewalks due to read traffic, under the condition of disabling the hardware prefetch while measuring bus traffic of the workload.
® The Intel Tuning Assistant can generate tuning advice based on counter monitor and sampling data. You can invoke the Intel Tuning Assistant from the source, counter monitor, or sampling views by clicking on the Intel Tuning Assistant icon. ®...
LAPACK and BLAS, Discrete Fourier Transforms (DFT), vector transcendental functions (vector math library/VML) and vector statistical functions (VSL). Intel MKL is optimized for the latest features and capabilities of the Intel Pentium 4 processor, Pentium M processor, Intel Xeon processors ®...
MKL and IPP functions are safe for use in a threaded environment. Optimizations with the Intel The Intel Performance Libraries implement a number of optimizations that are discussed throughout this manual. Examples include architecture-specific tuning such as loop unrolling, instruction pairing and scheduling;...
Intel Performance Libraries benefit from new architectural features of future generations of Intel processors simply by relinking the application with upgraded versions of the libraries. Enhanced Debugger (EDB) The Enhanced Debugger (EDB) enables you to debug C++, Fortran or mixed language programs running under Windows NT* or Windows 2000 (not Windows 98).
The Intel Thread Checker product is an Intel VTune Performance Analyzer plug-in data collector that executes your program and automatically locates threading errors. As your program runs, the Intel Thread Checker monitors memory accesses and other events and automatically detects situations which could cause unpredictable threading-related results.
Thread Profiler The thread profiler is a plug-in data collector for the Intel VTune Performance Analyzer. Use it to analyze threading performance and identify parallel performance problems. The thread profiler graphically illustrates what each thread is doing at various levels of detail using a hierarchical summary.
Figure A-3 Intel Thread Profiler Can Show Critical Paths of Threaded Execution Timelines ® Intel Software College ® The Intel Software College is a valuable resource for classes on Streaming SIMD Extensions 2 (SSE2), Threading and the IA-32 Intel Architecture. For online training on how to use the SSE2 and...
The descriptions of the Intel Pentium 4 processor performance metrics use terminology that are specific to the Intel NetBurst microarchitecture and to the implementation in the Pentium 4 and Intel Xeon processors. The following sections explain the terminology specific to Pentium 4...
Branch mispredictions incur a large penalty on microprocessors with deep pipelines. In general, the direction of branches can be predicted with a high degree of accuracy by the front end of the Intel Pentium 4 processor, such that most computations can be performed along the predicted path while waiting for the resolution of the branch.
Replay In order to maximize performance for the common case, the Intel NetBurst microarchitecture sometimes aggressively schedules execution before all the conditions for correct execution are guaranteed to be satisfied. In the event that all of these conditions are not satisfied, μ...
miss more than once during its life time, but a Misses Retired metric (for example, 1 μ for that Counting Clocks The count of cycles, also known as clock ticks, forms a fundamental basis for measuring how long a program takes to execute, and as part of efficiency ratios like cycles per instruction (CPI).
The first two metrics use performance counters, and thus can be used to cause interrupt upon overflow for sampling. They may also be useful for those cases where it is easier for a tool to read a performance counter instead of the time stamp counter. The timestamp counter is accessed via an instruction, RDTSC.
Non-Sleep Clockticks The performance monitoring counters can also be configured to count clocks whenever the performance monitoring hardware is not powered-down. To count “non-sleep clockticks” with a performance-monitoring counter, do the following: • Select any one of the 18 counters. •...
Using Performance Monitoring Events that logical processor is not halted (it may include some portion of the clock cycles for that logical processor to complete a transition into a halted state). A physical processor that supports Hyper-Threading Technology enters into a power-saving state if all logical processors are halted.
Microarchitecture Notes Trace Cache Events The trace cache is not directly comparable to an instruction cache. The two are organized very differently. For example, a trace can span many lines' worth of instruction-cache data. As with most microarchitectural elements, trace cache performance is only an issue if something else is not a bigger bottleneck.
Page 463
Using Performance Monitoring Events There is a simplified block diagram below of the sub-systems connected to the IOQ unit in the front side bus sub-system and the BSQ unit that interface to the IOQ. A two-way SMP configuration is illustrated. 1st-level cache misses and writebacks (also called core references) result in references to the 2nd-level cache.
Figure B-1 Relationships Between the Cache Hierarchy, IOQ, BSQ and Front Side Bus System Memory B-10 1st Level Data Unified 2nd Level Cache 3rd Level Cache Chip Set 3rd Level Cache 1st Level Data Unified 2nd Level Cache Cache FSB_ IOQ FSB_ IOQ Cache...
The granularities of core references are listed below, according to the performance monitoring events that are docu- mented in Appendix A of the IA-32 Intel® Architecture Software Devel- oper’s Manual, Volume 3B. Reads due to program loads •...
• IOQ_allocation, IOQ_active_entries: 64 bytes for hits or misses, smaller for partials' hits or misses Writebacks (dirty evictions) • BSQ_cache_reference: 64 bytes • BSQ_allocation: 64 bytes • BSQ_active_entries: 64 bytes • IOQ_allocation, IOQ_active_entries: 64 bytes The count of IOQ allocations may exceed the count of corresponding BSQ allocations on current implementations for several reasons, including: •...
2nd-level cache, and the 3rd-level cache if present. But due to the current implementation of BSQ_cache_reference in Pentium 4 and Intel Xeon processors, they should not be used to calculate cache hit rates or cache miss rates. The following three paragraphs describe some of the issues related to BSQ_cache_reference, so that its results can be better interpreted.
Page 468
64-byte granularity. Prefetches themselves are not counted as either hits or misses, as of Pentium 4 and Intel Xeon processors with a CPUID signature of 0xf21. However, in Pentium 4 Processor implementations with a CPUID signature of 0xf07 and earlier have the problem that reads to lines that are already being prefetched are counted as hits in addition to misses, thus overcounting hits.
That memory performance change may or may not be reflected in the measured FSB latencies. Also note that for Pentium 4 and Intel Xeon Processor implementations with an integrated 3rd-level cache, BSQ entries are allocated for all 2nd-level writebacks (replaced lines), not just those that become bus...
BSQ entries due to such references will become bus transactions. Metrics Descriptions and Categories The Performance metrics for Intel Pentium 4 and Intel Xeon processors are listed in Table B-1. These performance metrics consist of recipes to program specific Pentium 4 and Intel Xeon processor performance monitoring events to obtain event counts that represent one of the following: number of instructions, cycles, or occurrences.
Page 471
The additional sub-event information is included in column 3 as various tags, which are described in “Performance Metrics and Tagging Mechanisms”. For event names that appear in this column, refer to the IA-32 Intel® Architecture Software Developer’s Manual, Volumes 3A & 3B. •...
Table B-1 Pentium 4 Processor Performance Metrics Metric Description General Metrics Non-Sleep The number of Clockticks clockticks.while a processor is not in any sleep modes. Non-Halted The number of Clockticks clockticks that the processor is in not halted nor in sleep. Instructions Non-bogus IA-32 Retired...
Page 473
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Speculative Number of uops Uops Retired retired (include both instructions executed to completion and speculatively executed in the path of branch mispredictions). Branching Metrics Branches All branch Retired instructions executed to completion Tagged The events counts...
Page 474
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Mispredicted The number of returns mispredicted returns including all causes. All conditionals The number of branches that are conditional jumps (may overcount if the branch is from build mode or there is a machine clear near the branch) Mispredicted...
Page 475
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description TC Flushes Number of TC flushes (The counter will count twice for each occurrence. Divide the count by 2 to get the number of flushes.) Logical The number of Processor 0 cycles that the trace Deliver Mode and delivery engine...
Page 476
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Logical The number of Processor 1 cycles that the trace Deliver Mode and delivery engine (TDE) is delivering traces associated with logical processor 1, regardless of the operating modes of the TDE for traces associated with logical processor 0.
Page 477
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Logical The number of Processor 0 cycles that the trace Build Mode and delivery engine (TDE) is building traces associated with logical processor 0, regardless of the operating modes of the TDE for traces associated with logical processor 1.
Page 478
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Trace Cache The number of times Misses that significant delays occurred in order to decode instructions and build a trace because of a TC miss. TC to ROM Twice the number of Transfers times that the ROM microcode is...
Page 479
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Memory Metrics Page Walk The number of page DTLB All walk requests due to Misses DTLB misses from either load or store. -Level Cache The number of retired μops that Load Misses Retired experienced...
Page 480
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description 64K Aliasing The number of 64K Conflicts aliasing conflicts. A memory reference causing 64K aliasing conflict can be counted more than once in this stat. The performance penalty resulted from 64K-aliasing conflict can vary from being unnoticeable to...
Page 481
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description MOB Load The number of Replays replayed loads related to the Memory Order Buffer (MOB). This metric counts only the case where the store-forwarding data is not an aligned subset of the stored data.
Page 482
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description 2nd-Level The number of Cache Reads 2nd-level cache read Hit Shared references (loads and RFOs) that hit the cache line in shared state. Beware of granularity differences. 2nd-Level The number of Cache Reads 2nd-level cache read Hit Modified...
Page 483
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description 3rd-Level The number of Cache Reads 3rd-level cache read Hit Modified references (loads and RFOs) that hit the cache line in modified state. Beware of granularity differences. 3rd-Level The number of Cache Reads 3rd-level cache read Hit Exclusive...
Page 484
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description All WCB The number of times Evictions a WC buffer eviction occurred due to any causes (This can be used to distinguish 64K aliasing cases that contribute more significantly to performance penalty, e.g., stores that are 64K aliased.
Page 485
Beware of granularity issues with this event. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. Using Performance Monitoring Events...
Page 486
Beware of granularity issues with this event. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. B-32 Event Name or Metric Expression (Bus Accesses –...
Page 487
RFOs). Beware of granularity issues with this event. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. Reads The number of all...
Page 488
Beware of granularity issues with this event. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. All UC from the...
Page 489
“Bus Accesses from the processor” to get bus request latency. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. Using Performance Monitoring Events...
Page 490
Non-prefetch read request latency. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. B-36 Event Name or Metric...
Page 491
Divide by “All UC from the processor” to get UC request latency. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2.
Page 492
“Writes from the Processor” to get bus write request latency. Also Beware of different recipes in mask bits for Pentium 4 and Intel Xeon processors between CPUID model field value of 2 and model value less than 2. Bus Accesses...
Page 493
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Write WC Full The number of write (BSQ) (but neither writeback nor RFO) transactions to WC-type memory. Write WC The number of Partial (BSQ) partial write transactions to WC-type memory. User note: This event may undercount WC partials that originate...
Page 494
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Reads The number of read Non-prefetch (excludes RFOs and Full (BSQ) HW|SW prefetches) transactions to WB-type memory. Beware of granularity issues with this event. Reads The number of read Invalidate Full- invalidate (RFO) RFO (BSQ) transactions to...
Page 495
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description UC Write The number of UC Partial (BSQ) write transactions. Beware of granularity issues between BSQ and FSB IOQ events. IO Reads The number of Chunk (BSQ) 8-byte aligned IO port read transactions.
Page 496
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description WB Writes Full This is an accrued Underway sum of the durations (BSQ) of writeback (evicted from cache) transactions to WB-type memory. Divide by Writes WB Full (BSQ) to estimate average request latency.
Page 497
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Write WC Partial This is an accrued Underway sum of the durations (BSQ) of partial write transactions to WC-type memory. Divide by Write WC Partial (BSQ) to estimate average request latency. User note: Allocated entries of WC partials that originate...
Page 498
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description SSE Input The number of Assists occurrences of SSE/SSE2 floating-point operations needing assistance to handle an exception condition. The number of occurrences includes speculative counts. Packed SP Non-bogus packed Retired single-precision instructions retired.
Page 499
Table B-1 Pentium 4 Processor Performance Metrics (continued) Metric Description Stalled Cycles The duration of stalls of Store Buffer due to lack of store Resources buffers. (non-standard Stalls of Store The number of Buffer allocation stalls due Resources to lack of store (non-standard buffers.
Compare Edge split_load_retired to count at retirement. This section replay_event front_end_event . Please refer to Appendix A of the IA-32 Intel® μ ops at retirement using the μ ops so they can be detected at retirement. Some μ ops. The event names referenced in μ...
Table B-2 Metrics That Utilize Replay Tagging Mechanism Bit field to set: IA32_PEBS_ Replay Metric Tags ENABLE Bit 0, BIT 24, 1stL_cache_load_ BIT 25 miss_retired Bit 1, BIT 24, 2ndL_cache_load_ BIT 25 miss_retired Bit 2, BIT 24, DTLB_load_miss_ BIT 25 retired Bit 2, BIT 24, DTLB_store_miss_...
Tags for front_end_event Table B-3 provides a list of the tags that are used by various metrics derived from the column 2 can be found from the Pentium 4 processor performance monitoring events. Table B-3 Table 3 Metrics That Utilize the Front-end Tagging Mechanism Front-end MetricTags Memory_loads Memory_stores...
Table B-4 Metrics That Utilize the Execution Tagging Mechanism Execution Metric Tags Packed_SP_retired Scalar_SP_retired Scalar_DP_retired 128_bit_MMX_retired 64_bit_MMX_retired X87_FP_retired Using Performance Monitoring Events Tag Value in Upstream Upstream ESCR ESCR Set the ALL bit in the event mask and the TagUop bit in the ESCR of packed_SP_uop.
Using Performance Metrics with Hyper-Threading Technology On Intel Xeon processors that support Hyper-Threading Technology, the performance metrics listed in Table B-1 may be qualified to associate the counts with a specific logical processor, provided the relevant performance monitoring events supports qualification by logical processor.
The performance metrics listed in Table B-1 fall into three categories: • Logical processor specific and supporting parallel counting. • Logical processor specific but constrained by ESCR limitations. • Logical processor independent and not supporting parallel counting. Table B-5 lists performance metrics in the first and second category. Table B-6 lists performance metrics in the third category.
Page 506
Table B-6 Metrics That Support Qualification by Logical Processor and Parallel Counting (continued) Branching Metrics TC and Front End Metrics B-52 Branches Retired Tagged Mispredicted Branches Retired Mispredicted Branches Retired All returns All indirect branches All calls All conditionals Mispredicted returns Mispredicted indirect branches Mispredicted calls Mispredicted conditionals...
Page 507
Table B-6 Metrics That Support Qualification by Logical Processor and Parallel Counting (continued) Memory Metrics Using Performance Monitoring Events Split Load Replays Split Store Replays MOB Load Replays 64k Aliasing Conflicts 1st-Level Cache Load Misses Retired 2nd-Level Cache Load Misses Retired DTLB Load Misses Retired Split Loads Retired Split Stores Retired...
Page 508
Table B-6 Metrics That Support Qualification by Logical Processor and Parallel Counting (continued) Bus Metrics B-54 Bus Accesses from the Processor Non-prefetch Bus Accesses from the Processor Reads from the Processor Writes from the Processor Reads Non-prefetch from the Processor All WC from the Processor All UC from the Processor Bus Accesses from All Agents...
Table B-6 Metrics That Support Qualification by Logical Processor and Parallel Counting (continued) Characterization Metrics Parallel counting is not supported due to ESCR restrictions. Table B-7 Metrics That Are Independent of Logical Processors General Metrics TC and Front End Metrics Memory Metrics Bus Metrics Characterization Metrics...
Intel Core Duo processors There are performance events specific to the microarchitecture of Intel Core Solo and Intel Core Duo processors (see Table A-9 of the IA-32 Intel® Architecture Software Developer’s Manual, Volume 3B). Understanding the Results in a Performance Counter Each performance event detects a well-defined microarchitectural condition occurring in the core while the core is active.
There are three cycle-counting events which will not progress on a halted core, even if the halted core is being snooped. These are: Unhalted core cycles, Unhalted reference cycles, and Unhalted bus cycles. All three events are detected for the unit selected by event 3CH. Some events detect microarchitectural conditions but are limited in their ability to identify the originating core or physical processor.
Notes on Selected Events This section provides event-specific notes for interpreting performance events listed in Table A-9 of the IA-32 Intel® Architecture Software Developer’s Manual, Volume 3B. • L2_Reject_Cycles, event number 30H This event counts the cycles during which the L2 cache rejected new access requests.
Page 513
• Serial_Execution_Cycles, event number 3C, unit mask 02H This event counts the bus cycles during which the core is actively executing code (non-halted) while the other core in the physical processor is halted. • L1_Pref_Req, event number 4FH, unit mask 00H This event counts the number of times the Data Cache Unit (DCU) requests to prefetch a data cache line from the L2 cache.
IA-32 instructions The instruction timing data varies within the IA-32 family of processors. Only data specific to the Intel Pentium 4, Intel Xeon processors and Intel Pentium M processor are provided. The relevance of instruction throughput and latency information for code tuning is discussed in Chapter 1 and Chapter 2, see “Execution Core Detail”...
Overview The current generation of IA-32 family of processors use out-of-order execution with dynamic scheduling and buffering to tolerate poor instruction selection and scheduling that may occur in legacy code. It can reorder μops to cover latency delays and to avoid resource conflicts. In some cases, the microarchitecture’s ability to avoid such delays can be enhanced by arranging IA-32 instructions.
Page 517
ROM. These instructions with longer μop flows incur a delay in the front end and reduce the supply of uops to the execution core. In Pentium 4 and Intel Xeon processors, transfers to microcode ROM often reduce how efficiently μops can be packed into the trace cache.
FP_ADD FP_MUL cluster (see Figure 1-4, Figure 1-4 applies FP_EXECUTE to Pentium 4 and Intel Xeon processors with CPUID signature of family 15, model encoding = 0, 1, 2). , or in the MMX_SHFT...
Page 519
All numeric data in the tables are: — approximate and are subject to change in future implementations of the Intel NetBurst microarchitecture or the Pentium M processor microarchitecture. — not meant to be used as reference numbers for comparisons of instruction-level performance benchmarks.
Latency and Throughput with Register Operands IA-32 instruction latency and throughput data are presented in Table C-2 through Table C-8. The tables include the Streaming SIMD Extension 3, Streaming SIMD Extension 2, Streaming SIMD Extension, MMX technology and most of commonly used IA-32 instructions. Instruction latency and throughput of the Pentium 4 processor and of the Pentium M processor are given in separate columns.
Table C-5 Streaming SIMD Extension 64-bit Integer Instructions Instruction CPUID PAVGB/PAVGW mm, mm PEXTRW r32, mm, imm8 PINSRW mm, r32, imm8 PMAX mm, mm PMIN mm, mm PMOVMSKB r32, mm PMULHUW mm, mm PSADBW mm, mm PSHUFW mm, mm, imm8 See “Table Footnotes”...
Page 529
Table C-6 MMX Technology 64-bit Instructions (continued) Instruction PCMPGTB/PCMPGTD/ PCMPGTW mm, mm PMADDWD mm, mm PMULHW/PMULLW mm, mm POR mm, mm PSLLQ/PSLLW/ PSLLD mm, mm/imm8 PSRAW/PSRAD mm, mm/imm8 PSRLQ/PSRLW/PSRLD mm, mm/imm8 PSUBB/PSUBW/PSUBD mm, mm PSUBSB/PSUBSW/PSU BUSB/PSUBUSW mm, PUNPCKHBW/PUNPCK HWD/PUNPCKHDQ mm, mm PUNPCKLBW/PUNPCK LWD/PUNPCKLDQ mm, PXOR mm, mm...
The names of execution units apply to processor implementations of the Intel NetBurst microarchitecture only with CPUID signature of family 15, model encoding = 0, 1, 2. They include: FP_EXECUTE FPMOVE execution units and ports in the out-of-order core.
Pentium 4 and Intel Xeon processors. Latency and Throughput with Memory Operands The discussion of this section applies to the Intel Pentium 4 and Intel Xeon processors. Typically, instructions with a memory address as the source operand, add one more μop to the “reg, reg” instructions type listed in Table C-1 through C-7.
Page 535
IA-32 Instruction Latency and Throughput For the sake of simplicity, all data being requested is assumed to reside in the first level data cache (cache hit). In general, IA-32 instructions with load operations that execute in the integer ALU units require two more clock cycles than the corresponding register-to-register flavor of the same instruction.
__m128 register spill locations aligned throughout a function invocation.The Intel C++ Compiler for Win32* Systems supports conventions presented here help to prevent memory references from incurring penalties due to misaligned data by keeping them aligned to 16-byte boundaries. In addition, this scheme supports improved...
Page 538
Microsoft-compiled function, for example, can only assume that the frame pointer it used is 4-byte aligned. Earlier versions of the Intel C++ Compiler for Win32 Systems have attempted to provide 8-byte aligned stack frames by dynamically adjusting the stack frame pointer in the prologue of 8-byte alignment of the functions it compiles.
Figure D-1 Stack Frames Based on Alignment Type ESP-based Aligned Frame Parameters Return Address Padding Register Save Area Local Variables and Spill Slots __cdecl Parameter Passing Space __stdcall Parameter Passing Space As an optimization, an alternate entry point can be created that can be called when proper stack alignment is provided by the caller.
Stack Alignment Example D-1 in the following sections illustrate this technique. Note the entry points , the latter is the alternate aligned foo.aligned entry point. Aligned esp-Based Stack Frames This section discusses data and parameter alignment and the extended attribute, which can be used to request declspec(align) alignment in C and C++ code.
NOTE. block beginnings are aligned. This places the stack pointer at a 12 mod 16 boundary, as the return pointer has been pushed. Thus, the unaligned entry point must force the stack pointer to this boundary. stack is at an 8 mod 16 boundary, and adds sufficient space to the stack so that the stack pointer is aligned to a 0 mod 16 boundary.
16, and thus the caller must account for the remaining adjustment. Stack Frame Optimizations The Intel C++ Compiler provides certain optimizations that may improve the way aligned frames are set up and used. These optimizations are as follows: •...
(since function’s epilog). For additional information on the use of other related issues, see relevant application notes in the Intel Architecture Performance Training Center. D-10 register generally should not be CAUTION.
Mathematics of Prefetch Scheduling Distance This appendix discusses how far away to insert prefetch instructions. It presents a mathematical model allowing you to deduce a simplified equation which you can use for determining the prefetch scheduling distance (PSD) for your application. For your convenience, the first section presents this simplified equation;...
inst Consider the following example of a heuristic equation assuming that parameters have the values as indicated: ----------------------------------------------------- where 60 corresponds to The values of the parameters in the equation can be derived from the documentation for memory components and chipsets as well as from vendor datasheets.
Note that the potential effects of µop reordering are not factored into the estimations discussed. Examine Example E-1 that uses the prefetch scheduling distance of 3, that is, psd = 3. The data prefetched in iteration i, will actually be used in iteration i+3. T needed to execute while il (iteration latency) represents the cycles needed to execute this loop with actually run-time memory footprint.
Memory access plays a pivotal role in prefetch scheduling. For more understanding of a memory subsystem, consider Streaming SIMD Extensions and Streaming SIMD Extensions 2 memory pipeline depicted in Figure E-1. Figure E-1 Pentium , Pentium III and Pentium 4 Processors Memory Pipeline Sketch Assume that three cache lines are accessed per iteration and four chunks of data are returned per iteration for each cache line.
Page 551
varies dynamically and is also system hardware-dependent. The static variants include the core-to-front-side-bus ratio, memory manufacturer and memory controller (chipset). The dynamic variants include the memory page open/miss occasions, memory accesses sequence, different memory types, and so on. To determine the proper prefetch scheduling distance, follow these steps and formulae: •...
No Preloading or Prefetch The traditional programming approach does not perform data preloading or prefetch. It is sequential in nature and will experience stalls because the memory is unable to provide the data immediately when the execution pipeline requires it. Examine Figure E-2. Figure E-2 Execution Pipeline, No Preloading or Prefetch Execution Execution units idle...
The iteration latency is approximately equal to the computation latency plus the memory leadoff latency (includes cache miss latency, chipset latency, bus arbitration, and so on.) plus the data transfer latency where transfer latency= number of lines per iteration * line burst latency. This means that the decoupled memory and execution are ineffective to explore the parallelism because of flow dependency.
The following formula shows the relationship among the parameters: It can be seen from this relationship that the iteration latency is equal to the computation latency, which means the memory accesses are executed in background and their latencies are completely hidden. Compute Bound (Case: T Now consider the next case by first examining Figure E-4.
Page 555
Mathematics of Prefetch Scheduling Distance For this particular example the prefetch scheduling distance is greater than 1. Data being prefetched for iteration i will be consumed in iteration i+2. Figure E-4 represents the case when the leadoff latency plus data transfer latency is greater than the compute latency, which is greater than the data transfer latency.
Memory Throughput Bound (Case: T When the application or loop is memory throughput bound, the memory latency is no way to be hidden. Under such circumstances, the burst latency is always greater than the compute latency. Examine Figure E-5. Figure E-5 Memory Throughput Bound Pipeline Front-Side Bus Execution pipeline The following relationship calculates the prefetch scheduling distance...
Mathematics of Prefetch Scheduling Distance memory to you cannot do much about it. Typically, data copy from one space to another space, for example, graphics driver moving data from writeback memory to write-combining memory, belongs to this category, where performance advantage from prefetch instructions will be marginal.
Now for the case T examine the following graph. Consider the graph of accesses per iteration in example 1, Figure E-6. Figure E-6 Accesses per Iteration, Example 1 The prefetch scheduling distance is a step function of T computation latency. The steady state iteration latency (il) is either memory-bound or compute-bound depending on T scheduled effectively.
Figure E-7 Accesses per Iteration, Example 2 psd for different number of cache lines prefetched per iteration In reality, the front-side bus (FSB) pipelining depth is limited, that is, only four transactions are allowed at a time in the Pentium III and Pentium 4 processors.
Page 564
large load stalls, 2-37 latency, 2-72, 6-5 lea instruction, 2-74 loading and storing to and from the same DRAM page, 4-39 loop blocking, 3-34 loop unrolling, 2-26 loop unrolling option, A-5, A-6 memory bank conflicts, 6-3 memory O=optimization U=using P=prefetch, 6-18 memory operands, 2-71 memory optimization, 4-34...
Page 565
optimizing cache utilization cache management, 6-44 examples, 6-15 non-temporal store instructions, 6-10 prefetch and load, 6-9 prefetch Instructions, 6-8 prefetching, 6-7 SFENCE instruction, 6-15, 6-16 streaming, non-temporal stores, 6-10 optimizing floating-point applications copying, shuffling, 5-17 data arrangement, 5-4 data deswizzling, 5-14 data swizzling using intrinsics, 5-12 horizontal ADD, 5-18 planning considerations, 5-2...
Page 566
reciprocal instructions, 5-2 rounding control option, A-6 sampling event-based, A-10 Self-modifying code, 2-47 SFENCE Instruction, 6-15, 6-16 signed unpack, 4-7 SIMD integer code, 4-2 SIMD-floating-point code, 5-1 simplified 3D geometry pipeline, 6-22 simplified clipping to an arbitrary signed range, 4-28 single-pass versus multi-pass execution, 6-41 smart cache, 1-31 SoA format, 3-29...
Page 567
2 Cameron Close Intel Corp. Long Melford SUFFK Heldmanskamp 37 CO109TS Lemgo NW 32657 Germany Israel Intel Corp. Italy MTM Industrial Center, Intel Corp Italia Spa P.O.Box 498 Milanofiori Palazzo E/4 Haifa Assago 31000 Milan Israel 20094 Fax:972-4-8655444 Italy Fax:39-02-57501221 LATIN AMERICA &...
Page 568
Intel Corp. Intel Corp. 999 CANADA PLACE, 28202 Cabot Road, Suite 404,#11 Suite #363 & #371 Vancouver BC Laguna Niguel CA V6C 3E2 92677 Canada Fax:604-844-2813 Intel Corp. Intel Corp. 657 S Cendros Avenue 2650 Queensview Drive, Solana Beach CA...
Need help?
Do you have a question about the ARCHITECTURE IA-32 and is the answer not in the manual?
Questions and answers