Sign In
Upload
Manuals
Brands
Intel Manuals
Processor
ARCHITECTURE IA-32
Intel ARCHITECTURE IA-32 Manuals
Manuals and User Guides for Intel ARCHITECTURE IA-32. We have
1
Intel ARCHITECTURE IA-32 manual available for free PDF download: Reference Manual
Intel ARCHITECTURE IA-32 Reference Manual (568 pages)
Architecture Optimization
Brand:
Intel
| Category:
Processor
| Size: 2.55 MB
Table of Contents
Table of Contents
3
About this Manual
24
Chapter 1 Intel ® Architecture Processor Family Overview
25
Introduction
25
Related Documentation
27
SIMD Technology
30
Figure 1-1 Typical SIMD Operations
31
Figure 1-2 SIMD Instruction Register Usage
32
Summary of SIMD Technologies
33
MMX™ Technology
33
Streaming SIMD Extensions
33
Streaming SIMD Extensions 2
34
Streaming SIMD Extensions 3
34
Intel ® Extended Memory 64 Technology (Intel ® EM64T)
35
Intel Netburst ® Microarchitecture
36
Design Goals of Intel Netburst Microarchitecture
36
Overview of the Intel Netburst Microarchitecture Pipeline
37
Figure 1-3 the Intel Netburst Microarchitecture
38
The Front End
39
Retirement
40
The Out-Of-Order Core
40
Front End Pipeline Detail
41
Prefetching
41
Decoder
42
Execution Trace Cache
42
Branch Prediction
43
Execution Core Detail
44
Instruction Latency and Throughput
45
Execution Units and Issue Ports
46
Caches
47
Figure 1-4 Execution Units and Ports in the Out-Of-Order Core
47
Pentium 4 and Intel Xeon Processor Cache Parameters
48
Table
48
Table 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters
48
Data Prefetch
49
Loads and Stores
52
Store Forwarding
53
Intel ® Pentium ® M Processor Microarchitecture
54
The Front End
55
Figure 1-5 the Intel Pentium M Processor Microarchitecture
55
Data Prefetching
57
Out-Of-Order Core
58
Table 1-3 Cache Parameters of Pentium M, Intel
58
Table 1-2 Trigger Threshold and CPUID Signatures for IA-32 Processor Families
58
In-Order Retirement
59
Core™ solo and
59
Front End
60
Data Prefetching
61
Microarchitecture of Intel ® Core™ solo and Intel ® Core™ Duo Processors
59
Hyper-Threading Technology
61
Figure 1-6 Hyper-Threading Technology on an SMP
63
Processor Resources and Hyper-Threading Technology
64
Partitioned Resources
64
Replicated Resources
64
Shared Resources
65
Front End Pipeline
66
Microarchitecture Pipeline and Hyper-Threading Technology
66
Execution Core
67
Retirement
67
Multi-Core Processors
67
Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition and Intel Core Duo Processor
69
Microarchitecture Pipeline and Multi-Core Processors
70
Shared Cache in Intel Core Duo Processors
70
Load and Store Operations
70
Table 1-4 Family and Model Designations of Microarchitectures
70
Table
70
Characteristics of Load and Store Operations in Intel Core Duo Processors
71
Table
71
Chapter 2 General Optimization Guidelines
73
Tuning to Achieve Optimum Performance
73
Tuning to Prevent Known Coding Pitfalls
74
Coding Pitfalls Affecting Performance
74
Table
74
General Practices and Coding Guidelines
75
Table
75
Use Available Performance Tools
76
Optimize Performance Across Processor Generations
76
Optimize Branch Predictability
77
Optimize Memory Access
77
Optimize Floating-Point Performance
78
Optimize Instruction Selection
78
Optimize Instruction Scheduling
79
Enable Vectorization
79
Coding Rules, Suggestions and Tuning Hints
80
Performance Tools
81
Intel ® C++ Compiler
81
General Compiler Recommendations
82
Vtune™ Performance Analyzer
82
Processor Perspectives
83
CPUID Dispatch Strategy and Compatible Code Strategy
85
Transparent Cache-Parameter Strategy
86
Threading Strategy and Hardware Multi-Threading Support
86
Branch Prediction
87
Eliminating Branches
87
Example 2-1 Assembly Code with an Unpredictable Branch
89
Example 2-2 Code Optimization to Eliminate Branches
89
Spin-Wait and Idle Loops
90
Example 2-3 Eliminating Branch with CMOV Instruction
90
Static Prediction
91
Example 2-4 Use of Pause Instruction
91
Example 2-5 Pentium 4 Processor Static Branch Prediction Algorithm
92
Example 2-6 Static Taken Prediction Example
93
Example 2-7 Static Not-Taken Prediction Example
93
Inlining, Calls and Returns
94
Branch Type Selection
95
Example 2-8 Indirect Branch with Two Favored Targets
97
Loop Unrolling
98
Example 2-9 a Peeling Technique to Reduce Indirect Branch Misprediction
98
Compiler Support for Branch Prediction
100
Example 2-10 Loop Unrolling
100
Memory Accesses
101
Alignment
101
Example 2-11 Code that Causes Cache Line Split
103
Figure 2-1 Cache Line Split in Accessing Elements in a Array
103
Store Forwarding
104
Store-To-Load-Forwarding Restriction on Size and Alignment
105
Figure 2-2 Size and Alignment Restrictions in Store Forwarding
106
Example 2-12 Several Situations of Small Loads after Large Store
107
Example 2-14 a Non-Forwarding Situation in Compiler Generated Code
108
Example 2-15 Two Examples to Avoid the Non-Forwarding Situation in Example 2-14
108
Example 2-13 a Non-Forwarding Example of Large Load after Small Store
108
Example 2-16 Large and Small Load Stalls
109
Store-Forwarding Restriction on Data Availability
110
Data Layout Optimizations
111
Example 2-17 an Example of Loop-Carried Dependence Chain
111
Example 2-18 Rearranging a Data Structure
111
Example 2-19 Decomposing an Array
112
Stack Alignment
114
Capacity Limits and Aliasing in Caches
115
Example 2-20 Dynamic Stack Alignment
115
Capacity Limits in Set-Associative Caches
116
Aliasing Cases in the Pentium ® 4 and Intel ® Xeon ® Processors
117
Aliasing Cases in the Pentium M Processor
118
Mixing Code and Data
119
Self-Modifying Code
119
Write Combining
120
Locality Enhancement
122
Minimizing Bus Latency
124
Non-Temporal Store Bus Traffic
125
Example 2-21 Non-Temporal Stores and 64-Byte Bus Write Transactions
126
Example 2-22 Non-Temporal Stores and Partial Bus Write Transactions
126
Prefetching
127
Hardware Instruction Fetching
127
Software and Hardware Cache Line Fetching
127
Cacheability Instructions
128
Code Alignment
129
Improving the Performance of Floating-Point Applications
129
Guidelines for Optimizing Floating-Point Code
130
Floating-Point Modes and Exceptions
132
Floating-Point Exceptions
132
Floating-Point Modes
134
Example 2-23 Algorithm to Avoid Changing the Rounding Mode
138
Improving Parallelism and the Use of FXCH
140
X87 Vs. Scalar SIMD Floating-Point Trade-Offs
141
Scalar SSE/SSE2 Performance on Intel Core solo and Intel Core Duo Processors
142
Memory Operands
143
Floating-Point Stalls
144
Transcendental Functions
144
X87 Floating-Point Comparison Instructions
144
X87 Floating-Point Operations with Integer Operands
144
Instruction Selection
145
Complex Instructions
146
Use of the Lea Instruction
146
Use of the Inc and Dec Instructions
147
Use of the Shift and Rotate Instructions
147
Flag Register Accesses
147
Integer Divide
148
Operand Sizes and Partial Register Accesses
148
Table 2-2 Avoiding Partial Flag Register Stall
148
Example 2-24 Dependencies Caused by Referencing Partial Registers
149
Table 2-3 Avoiding Partial Register Stall When Packing Byte Values
150
Prefixes and Instruction Decoding
152
REP Prefix and Data Movement
153
Table 2-4 Avoiding False LCP Delays with 0Xf7 Group Instructions
153
Table 2-5 Using REP STOSD with Arbitrary Count Size and 4-Byte-Aligned Destination
157
Address Calculations
158
Clearing Registers
159
Compares
159
Floating Point/Simd Operands
160
Code Sequences that Operate on Memory Operands
162
Prolog Sequences
162
Instruction Scheduling
163
Latencies and Resource Constraints
163
Example 2-25 Recombining LOAD/OP Code into REG,MEM Form
163
Spill Scheduling
164
Scheduling Rules for the Pentium 4 Processor Decoder
164
Scheduling Rules for the Pentium M Processor Decoder
165
Vectorization
165
Miscellaneous
167
Nops
167
Summary of Rules and Suggestions
168
User/Source Coding Rules
169
Assembly/Compiler Coding Rules
171
Tuning Suggestions
180
Chapter 3 Coding for SIMD Architectures
181
Checking for Processor Support of SIMD Technologies
182
Checking for MMX Technology Support
182
Checking for Streaming SIMD Extensions Support
183
Example 3-1 Identification of MMX Technology with Cpuid
183
Cpuid Instruction
183
Example 3-2 Identification of SSE with Cpuid
184
Example 3-3 Identification of SSE by the os
184
Checking for Streaming SIMD Extensions 2 Support
185
Example 3-4 Identification of SSE2 with Cpuid
185
Checking for Streaming SIMD Extensions 3 Support
186
Example 3-5 Identification of SSE2 by the os
186
Example 3-6 Identification of SSE3 with Cpuid
187
Considerations for Code Conversion to SIMD Programming
188
Example 3-7 Identification of SSE3 by the os
188
Figure 3-1 Converting to Streaming SIMD Extensions Chart
189
Identifying Hot Spots
190
Determine if Code Benefits by Conversion to SIMD Execution
191
Coding Techniques
192
Coding Methodologies
193
Figure 3-2 Hand-Coded Assembly and High-Level Compiler Performance Trade-Offs
193
Example 3-8 Simple Four-Iteration Loop
194
Assembly
195
Intrinsics
195
Example 3-10 Simple Four-Iteration Loop Coded with Intrinsics
196
Classes
197
Automatic Vectorization
198
Example 3-11 C++ Code Using the Vector Classes
198
Example 3-12 Automatic Vectorization for a Simple Loop
199
Stack and Data Alignment
200
Alignment and Contiguity of Data Access Patterns
200
Using Padding to Align Data
200
Using Arrays to Make Data Contiguous
201
Stack Alignment for 128-Bit SIMD Technologies
202
Data Alignment for MMX Technology
203
Example 3-13 C Algorithm for 64-Bit Data Alignment
203
Data Alignment for 128-Bit Data
204
Compiler-Supported Alignment
204
Improving Memory Utilization
207
Data Structure Layout
207
Example 3-16 Aos and Soa Code Samples
208
Strip Mining
212
Example 3-18 Pseudo-Code before Strip Mining
212
Example 3-19 Strip Mined Code
213
Loop Blocking
214
Example 3-20 Loop Blocking
215
Figure 3-3 Loop Blocking Access Pattern
216
Instruction Selection
217
Example 3-21 Emulation of Conditional Moves
217
SIMD Optimizations and Microarchitectures
218
Tuning the Final Application
219
Chapter 4 Optimizing for SIMD Integer Applications
221
General Rules on SIMD Integer Code
222
Using SIMD Integer with X87 Floating-Point
223
Using the EMMS Instruction
223
Guidelines for Using EMMS Instruction
224
Example 4-1 Resetting the Register between __M64 and FP Data Types
225
Data Alignment
226
Data Movement Coding Techniques
226
Unsigned Unpack
226
Signed Unpack
227
Interleaved Pack with Saturation
228
Example 4-3 Signed Unpack Code
228
Figure 4-2 Interleaved Pack with Saturation
229
Figure 4-1 PACKSSDW MM, MM/Mm64 Instruction Example
229
Interleaved Pack Without Saturation
230
Example 4-4 Interleaved Pack with Saturation
230
Non-Interleaved Unpack
231
Example 4-5 Interleaved Pack Without Saturation
231
Figure 4-4 Result of Non-Interleaved Unpack High in MM1
232
Figure 4-3 Result of Non-Interleaved Unpack Low in MM0
232
Extract Word
233
Example 4-6 Unpacking Two Packed-Word Sources in a Non-Interleaved Way
233
Insert Word
234
Example 4-7 Pextrw Instruction Code
234
Figure 4-5 Pextrw Instruction
234
Example 4-8 Pinsrw Instruction Code
235
Figure 4-6 Pinsrw Instruction
235
Move Byte Mask to Integer
236
Example 4-9 Repeated Pinsrw Instruction Code
236
Example 4-10 Pmovmskb Instruction Code
237
Figure 4-7 Pmovmskb Instruction Example
237
Packed Shuffle Word for 64-Bit Registers
238
Figure 4-8 Pshuf Instruction Example
238
Packed Shuffle Word for 128-Bit Registers
239
Example 4-12 Broadcast Using 2 Instructions
239
Example 4-11 Pshuf Instruction Code
239
Unpacking/Interleaving 64-Bit Data in 128-Bit Registers
240
Example 4-13 Swap Using 3 Instructions
240
Example 4-14 Reverse Using 3 Instructions
240
Data Movement
241
Conversion Instructions
241
Generating Constants
241
Example 4-15 Generating Constants
241
Building Blocks
243
Absolute Difference of Unsigned Numbers
243
Example 4-16 Absolute Difference of Two Unsigned Numbers
243
Absolute Difference of Signed Numbers
244
Example 4-17 Absolute Difference of Signed Numbers
244
Absolute Value
245
Example 4-18 Computing Absolute Value
245
Clipping to an Arbitrary Range [High, Low]
246
Highly Efficient Clipping
247
Example 4-19 Clipping to a Signed Range of Words [High, Low]
247
Example 4-20 Clipping to an Arbitrary Signed Range [High, Low]
247
Clipping to an Arbitrary Unsigned Range [High, Low]
248
Example 4-21 Simplified Clipping to an Arbitrary Signed Range
248
Packed Max/Min of Signed Word and Unsigned Byte
249
Signed Word
249
Example 4-22 Clipping to an Arbitrary Unsigned Range [High, Low]
249
Unsigned Byte
250
Packed Multiply High Unsigned
250
Packed Sum of Absolute Differences
250
Packed Average (Byte/Word)
251
Figure 4-9 PSADBW Instruction Example
251
Complex Multiply by a Constant
252
Example 4-23 Complex Multiply by a Constant
252
Packed 32*32 Multiply
253
Packed 64-Bit Add/Subtract
253
128-Bit Shifts
253
Memory Optimizations
254
Partial Memory Accesses
255
Example 4-24 a Large Load after a Series of Small Stores (Penalty)
255
Example 4-25 Accessing Data Without Delay
255
Example 4-26 a Series of Small Loads after a Large Store
256
Example 4-27 Eliminating Delay for a Series of Small Loads after a Large Store
256
Supplemental Techniques for Avoiding Cache Line Splits
257
Example 4-28 an Example of Video Processing with Cache Line Splits
257
Example 4-29 Video Processing Using LDDQU to Avoid Cache Line Splits
258
Increasing Bandwidth of Memory Fills and Video Fills
259
Increasing Memory Bandwidth by Loading and Storing to and from the same DRAM Page
259
Increasing Memory Bandwidth Using the MOVDQ Instruction
259
Increasing UC and WC Store Bandwidth by Using Aligned Stores
260
Converting from 64-Bit to 128-Bit SIMD Integer
260
SIMD Optimizations and Microarchitectures
261
Packed SSE2 Integer Versus MMX Instructions
262
Chapter 5 Optimizing for SIMD Floating-Point Applications
263
General Rules for SIMD Floating-Point Code
263
Planning Considerations
264
Using SIMD Floating-Point with X87 Floating-Point
265
Scalar Floating-Point Code
265
Data Alignment
266
Data Arrangement
266
Figure 5-1 Homogeneous Operation on Parallel Data Elements
267
Vertical Versus Horizontal Computation
267
Table 5-1 Soa Form of Representing Vertices Data
269
Example 5-1 Pseudocode for Horizontal (Xyz, Aos) Computation
270
Figure 5-2 Dot Product Operation
270
Data Swizzling
271
Example 5-2 Pseudocode for Vertical (XXXX, Yyyy, Zzzz, Soa) Computation
271
Example 5-3 Swizzling Data
272
Example 5-4 Swizzling Data Using Intrinsics
274
Data Deswizzling
276
Example 5-5 Deswizzling Single-Precision SIMD Data
276
Example 5-6 Deswizzling Data Using the Movlhps and Shuffle Instructions
277
Example 5-7 Deswizzling Data 64-Bit Integer SIMD Data
278
Using MMX Technology Code for Copy or Shuffling Functions
279
Example 5-8 Using MMX Technology Code for Copying or Shuffling
280
Horizontal ADD Using SSE
280
Figure 5-3 Horizontal Add Using Movhlps/Movlhps
281
Use of Cvttps2Pi/Cvttss2Si Instructions
283
Example 5-10 Horizontal Add Using Intrinsics with Movhlps/Movlhps
283
Flush-To-Zero and Denormals-Are-Zero Modes
284
SIMD Floating-Point Programming Using SSE3
284
SSE3 and Complex Arithmetics
285
Figure 5-5 Horizontal Arithmetic Operation of the SSE3 Instruction HADDPD
285
Figure 5-4 Asymmetric Arithmetic Operation of the SSE3 Instruction
285
Example 5-11 Multiplication of Two Pair of Single-Precision Complex Number
286
Example 5-12 Division of Two Pair of Single-Precision Complex Number
287
SSE3 and Horizontal Computation
288
Example 5-13 Calculating Dot Products from AOS
288
SIMD Optimizations and Microarchitectures
289
Packed Floating-Point Performance
289
Chapter 6 Optimizing Cache Usage
291
General Prefetch Coding Guidelines
292
Hardware Prefetching of Data
294
Prefetch and Cacheability Instructions
295
Prefetch
296
Software Data Prefetch
296
The Prefetch Instructions - Pentium 4 Processor Implementation
298
Prefetch and Load Instructions
298
Cacheability Control
299
The Non-Temporal Store Instructions
300
Fencing
300
Streaming Non-Temporal Stores
300
Memory Type and Non-Temporal Stores
301
Write-Combining
302
Streaming Store Usage Models
303
Coherent Requests
303
Non-Coherent Requests
303
Streaming Store Instruction Descriptions
304
The Fence Instructions
305
The Sfence Instruction
305
The Lfence Instruction
306
The Mfence Instruction
306
The Clflush Instruction
307
Memory Optimization Using Prefetch
308
Software-Controlled Prefetch
308
Example 6-1 Pseudo-Code for Using Cflush
308
Hardware Prefetch
309
Example of Effective Latency Reduction with H/W Prefetch
310
Example 6-2 Populating an Array for Circular Pointer Chasing with Constant Stride
311
Example of Latency Hiding with S/W Prefetch Instruction
312
Figure 6-1 Effective Latency Reduction as a Function of Access Stride
312
Figure 6-2 Memory Access Latency and Execution Without Prefetch
313
Figure 6-3 Memory Access Latency and Execution with Prefetch
313
Software Prefetching Usage Checklist
314
Software Prefetch Scheduling Distance
315
Software Prefetch Concatenation
316
Example 6-3 Prefetch Scheduling Distance
316
Example 6-4 Using Prefetch Concatenation
318
Example 6-5 Concatenation and Unrolling the Last Iteration of Inner Loop
318
Minimize Number of Software Prefetches
319
Figure 6-4 Prefetch and Loop Unrolling
319
Figure 6-5 Memory Access Latency and Execution with Prefetch
321
MIX Software Prefetch with Computation Instructions
322
Example 6-6 Spread Prefetch Instructions
323
Software Prefetch and Cache Blocking Techniques
324
Figure 6-6 Cache Blocking - Temporally Adjacent and Non-Adjacent Passes
325
Figure 6-7 Examples of Prefetch and Strip-Mining for Temporally Adjacent and Non-Adjacent Passes Loops
326
Example 6-7 Data Access of a 3D Geometry Engine Without Strip-Mining
327
Example 6-8 Data Access of a 3D Geometry Engine with Strip-Mining
328
Hardware Prefetching and Cache Blocking Techniques
329
Table 6-1 Software Prefetching Considerations into Strip-Mining Code
329
Example 6-9 Using HW Prefetch to Improve Read-Once Memory Traffic
330
Single-Pass Versus Multi-Pass Execution
331
Figure 6-8 Single-Pass Vs. Multi-Pass 3D Geometry Engines
332
Memory Optimization Using Non-Temporal Stores
333
Non-Temporal Stores and Software Write-Combining
333
Cache Management
334
Video Decoder
335
Video Encoder
335
Conclusions from Video Encoder and Decoder Implementation
336
Optimizing Memory Copy Routines
336
Example 6-10 Basic Algorithm of a Simple Memory Copy
336
TLB Priming
337
Using the 8-Byte Streaming Stores and Software Prefetch
338
Example 6-11 a Memory Copy Routine Using Software Prefetch
338
Using 16-Byte Streaming Stores and Hardware Prefetch
340
Example 6-12 Memory Copy Using Hardware Prefetch and Bus Segmentation
340
Performance Comparisons of Memory Copy Routines
342
Relative Performance of Memory Copy Routines
342
Table
342
Deterministic Cache Parameters
343
Table 6-3 Deterministic Cache Parameters Leaf
344
Cache Sharing in Single-Core or Multi-Core
345
Cache Sharing Using Deterministic Cache Parameters
345
Determine Prefetch Stride Using Deterministic Cache Parameters
346
Chapter 7 Multi-Core and Hyper-Threading Technology
347
Performance and Usage Models
348
Multithreading
348
Figure 7-1 Amdahl's Law and MP Speed-Up
349
Multitasking Environment
350
Programming Models and Multithreading
352
Parallel Programming Models
353
Domain Decomposition
353
Functional Decomposition
354
Specialized Programming Models
354
Example 7-1 Serial Execution of Producer and Consumer Work Items
355
Figure 7-2 Single-Threaded Execution of Producer-Consumer Threading Model
355
Figure 7-3 Execution of Producer-Consumer Threading Model on a Multi-Core Processor
356
Producer-Consumer Threading Models
356
Example 7-2 Basic Structure of Implementing Producer Consumer Threads
357
Figure 7-4 Interlaced Variation of the Producer Consumer Model
358
Example 7-3 Thread Function for an Interlaced Producer Consumer Model
359
Tools for Creating Multithreaded Applications
360
Optimization Guidelines
362
Key Practices of Thread Synchronization
362
Key Practices of System Bus Optimization
363
Key Practices of Memory Optimization
363
Key Practices of Front-End Optimization
364
Key Practices of Execution Resource Optimization
364
Generality and Performance Impact
365
Thread Synchronization
365
Choice of Synchronization Primitives
366
Table 7-1 Properties of Synchronization Objects
367
Synchronization for Short Periods
368
Example 7-4 Spin-Wait Loop and PAUSE Instructions
370
Optimization with Spin-Locks
371
Synchronization for Longer Periods
372
Avoid Coding Pitfalls in Thread Synchronization
374
Example 7-5 Coding Pitfall Using Spin Wait Loop
375
Prevent Sharing of Modified Data and False-Sharing
376
Placement of Shared Synchronization Variable
377
Example 7-6 Placement of Synchronization and Regular Variables
378
Example 7-7 Declaring Synchronization Variables Without Sharing a Cache Line
378
System Bus Optimization
379
Conserve Bus Bandwidth
380
Understand the Bus and Cache Interactions
381
Avoid Excessive Software Prefetches
382
Improve Effective Latency of Cache Misses
382
Use Full Write Transactions to Achieve Higher Data Rate
383
Memory Optimization
384
Cache Blocking Technique
384
Shared-Memory Optimization
385
Minimize Sharing of Data between Physical Processors
385
Batched Producer-Consumer Model
386
Figure 7-5 Batched Approach of Producer Consumer Model
386
Example 7-8 Batched Implementation of the Producer Consumer Threads
387
Eliminate 64-Kbyte Aliased Data Accesses
388
Preventing Excessive Evictions in First-Level Data Cache
389
Per-Thread Stack Offset
390
Example 7-9 Adding an Offset to the Stack Pointer of Three Threads
391
Per-Instance Stack Offset
392
Example 7-10 Adding a Pseudo-Random Offset to the Stack Pointer in the Entry Function
393
Front-End Optimization
394
Avoid Excessive Loop Unrolling
394
Optimization for Code Size
395
Using Thread Affinities to Manage Shared Platform Resources
395
Example 7-11 Assembling 3-Level Ids, Affinity Masks for each Logical Processor
397
Example 7-12 Assembling a Look up Table to Manage Affinity Masks and Schedule Threads to each Core First
400
Example 7-13 Discovering the Affinity Masks for Sibling Logical Processors Sharing the same Cache
401
Using Shared Execution Resources in a Processor Core
405
Chapter 8 64-Bit Mode Coding Guidelines
409
Introduction
409
Coding Rules Affecting 64-Bit Mode
409
Use Legacy 32-Bit Instructions When the Data Size Is 32 Bits
409
Use Extra Registers to Reduce Register Pressure
410
Use 64-Bit by 64-Bit Multiplies that Produce 128-Bit Results Only When Necessary
410
Sign Extension to Full 64-Bits
411
Alternate Coding Rules for 64-Bit Mode
412
Use 64-Bit Registers Instead of Two 32-Bit Registers for 64-Bit Arithmetic
412
Use 32-Bit Versions of CVTSI2SS and CVTSI2SD When Possible
414
Using Software Prefetch
414
Chapter 9 Power Optimization for Mobile Usages
415
Overview
415
Mobile Usage Scenarios
416
Figure 9-1 Performance History and State Transitions
417
ACPI C-States
418
Figure 9-2 Active Time Versus Halted Time of a Processor
418
Figure 9-3 Application of C-States to Idle Time
420
Processor-Specific C4 and Deep C4 States
420
Guidelines for Extending Battery Life
421
Adjust Performance to Meet Quality of Features
422
Reducing Amount of Work
423
Platform-Level Optimizations
424
Handling Sleep State Transitions
425
Using Enhanced Intel Speedstep ® Technology
426
Figure 9-4 Profiles of Coarse Task Scheduling and Power Consumption
426
Enabling Intel ® Enhanced Deeper Sleep
428
Multi-Core Considerations
429
Enhanced Intel Speedstep ® Technology
429
Thread Migration Considerations
430
Figure 9-5 Thread Migration in a Multi-Core Processor
431
Multi-Core Considerations for C-States
431
Figure 9-6 Progression to Deeper Sleep
432
Appendix A Application Performance Tools
435
Intel ® Compilers
436
Code Optimization Options
437
Targeting a Processor (-Gn
437
Automatic Processor Dispatch Support (-Qx[Extensions] and -Qax[Extensions
438
Vectorizer Switch Options
439
Loop Unrolling
439
Multithreading with Openmp
440
Inline Expansion of Library Functions (-Oi, -Oi
440
Floating-Point Arithmetic Precision
440
Qlong_Double
440
Rounding Control Option (-Qrcd
440
Interprocedural and Profile-Guided Optimizations
441
Interprocedural Optimization (IPO
441
Profile-Guided Optimization (PGO
441
Intel ® Vtune™ Performance Analyzer
442
Sampling
443
Time-Based Sampling
443
Event-Based Sampling
444
Figure A-1 Sampling Analysis of Hotspots by Location
444
Workload Characterization
445
Call Graph
447
Counter Monitor
448
Intel ® Tuning Assistant
448
Intel ® Performance Libraries
448
Benefits Summary
449
Optimizations with the Intel ® Performance Libraries
450
Enhanced Debugger (EDB
451
Intel ® Threading Tools
451
Intel ® Thread Checker
451
Figure A-2 Intel Thread Checker Can Locate Data Race Conditions
452
Thread Profiler
453
Intel ® Software College
454
Figure A-3 Intel Thread Profiler Can Show Critical Paths of Threaded Execution Timelines
454
Appendix B Using Performance Monitoring Events
455
Pentium 4 Processor Performance Metrics
455
Pentium 4 Processor-Specific Terminology
456
Bogus, Non-Bogus, Retire
456
Bus Ratio
456
Replay
457
Assist
457
Tagging
457
Counting Clocks
458
Non-Halted Clockticks
459
Non-Sleep Clockticks
460
Time Stamp Counter
461
Microarchitecture Notes
462
Trace Cache Events
462
Bus and Memory Metrics
462
Figure B-1 Relationships between the Cache Hierarchy, IOQ, BSQ and Front Side Bus
464
Reads Due to Program Loads
465
Reads Due to Program Writes (Rfos
465
Writebacks (Dirty Evictions
466
Usage Notes for Specific Metrics
467
Usage Notes on Bus Activities
469
Metrics Descriptions and Categories
470
Table B-1 Pentium 4 Processor Performance Metrics
472
Performance Metrics and Tagging Mechanisms
500
Tags for Replay_Event
500
Table B-2 Metrics that Utilize Replay Tagging Mechanism
501
Tags for Front_End_Event
502
Tags for Execution_Event
502
Table B-3 Table 3 Metrics that Utilize the Front-End Tagging Mechanism
502
Table B-4 Metrics that Utilize the Execution Tagging Mechanism
503
Using Performance Metrics with Hyper-Threading Technology
504
Table B-5 New Metrics for Pentium 4 Processor (Family 15, Model 3
504
Table B-6 Metrics that Support Qualification by Logical Processor and Parallel Counting
505
Table B-7 Metrics that Are Independent of Logical Processors
509
Using Performance Events of Intel Core solo and Intel Core Duo Processors
510
Understanding the Results in a Performance Counter
510
Ratio Interpretation
511
Notes on Selected Events
512
Appendix C IA-32 Instruction Latency and Throughput
515
Overview
516
Definitions
518
Latency and Throughput
518
Latency and Throughput with Register Operands
520
Table C-1 Streaming SIMD Extension 3 SIMD Floating-Point Instructions
520
Table C-2 Streaming SIMD Extension 2 128-Bit Integer Instructions
521
Table C-3 Streaming SIMD Extension 2 Double-Precision Floating-Point Instructions
523
Table C-4 Streaming SIMD Extension Single-Precision Floating-Point Instructions
526
Table C-6 MMX Technology 64-Bit Instructions
528
Table Footnotes
533
Latency and Throughput with Memory Operands
534
Appendix D Stack Alignment
537
Stack Frames
537
Figure D-1 Stack Frames Based on Alignment Type
539
Aligned Esp-Based Stack Frames
540
Example D-1 Aligned Esp-Based Stack Frames
541
Aligned Ebp-Based Stack Frames
542
Example D-2 Aligned Ebp-Based Stack Frames
543
Stack Frame Optimizations
545
Inlined Assembly and Ebx
546
Appendix E Mathematics of Prefetch Scheduling Distance
547
Simplified Equation
547
Mathematical Model for PSD
548
Example E-1 Calculating Insertion for Scheduling Distance of 3
549
Figure E-1 Pentium II, Pentium III and Pentium 4 Processors Memory Pipeline Sketch
550
Figure E-2 Execution Pipeline, no Preloading or Prefetch
552
No Preloading or Prefetch
552
Figure E-3 Compute Bound Execution Pipeline
553
Compute Bound (Case: Tl + Tb > Tc > Tb
554
Compute Bound (Case:tc >= T L + T B
554
Figure E-4 Another Compute Bound Execution Pipeline
554
Figure E-5 Memory Throughput Bound Pipeline
556
Memory Throughput Bound (Case: Tb >= Tc
556
Example
557
Figure E-6 Accesses Per Iteration, Example 1
558
Figure E-7 Accesses Per Iteration, Example 2
559
Advertisement
Advertisement
Related Products
Intel iAPX 186/188
Intel iAPX 86/88
Intel i486
Intel i960 Series
Intel iAPX 86
Intel iAPX 88
INTEL CORE I7-620LE - DATASHEET ADDENDUM
Intel Core i3 Desktop Series
Intel Itanium 9150M
Intel Core i5-2500K
Intel Categories
Motherboard
Computer Hardware
Server
Server Board
Desktop
More Intel Manuals
Login
Sign In
OR
Sign in with Facebook
Sign in with Google
Upload manual
Upload from disk
Upload from URL