Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 Manual
Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 Manual

Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 Manual

Hide thumbs Also See for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3:
Table of Contents

Advertisement

Quick Links

Advertisement

Table of Contents
loading

Summary of Contents for Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3

  • Page 2 ® ® Intel Itanium Architecture Software Developer’s Manual Volume 1: Application Architecture Revision 2.3 May 2010 Document Number: 245317...
  • Page 3 Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling1-800-548-4725, or by visiting Intel's website at http://www.intel.com.
  • Page 4: Table Of Contents

    Part 1: Application Architecture Guide ......1:3 1.1.2 Part 2: Optimization Guide for the Intel® Itanium® Architecture ..1:3 Overview of Volume 2: System Architecture.
  • Page 5 Floating-point Interruptions ........1:101 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 6 Additions beyond the IEEE Standard ......1:107 ® ® IA-32 Application Execution Model in an Intel Itanium System Environment ..1:109 IA-32 Execution Layer .
  • Page 7 Software Pipelining ......... 1:183 ® ® Loop Support Features in the Intel Itanium Architecture ....1:184 5.4.1...
  • Page 8 IA-32 Application Register Model ..........1:114 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 9 Memory Addressing Model..........1:131 ® ® Part II: Optimization Guide for the Intel Itanium Architecture Control Dependency Preventing Code Motion .
  • Page 10 IA-32 Floating-point Status Register Mapping (FSR) ....1:127 ® ® Part II: Optimization Guide for the Intel Itanium Architecture ctop Loop Trace ........1:188 wtop Loop Trace .
  • Page 11 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 12 Part I: Application Architecture Guide Intel® Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 13 Intel® Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 14: Overview Of Volume 1: Application Architecture

    IA-32 application interface. This volume also describes optimization techniques used to generate high performance software. 1.1.1 Part 1: Application Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® ® ® Itanium Architecture Software Developer’s Manual.Intel...
  • Page 15: Predication, Control Flow, And Instruction Stream

    1.2.1 Part 1: System Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. ® ®...
  • Page 16: Part 2: System Programmer's Guide

    Chapter 9, “IA-32 Interruption Vector Descriptions” lists IA-32 exceptions, interrupts and intercepts that can occur during IA-32 instruction set execution in the Itanium System Environment. ® Chapter 10, “Itanium Architecture-based Operating System Interaction Model with IA-32 Applications” defines the operation of IA-32 instructions within the Itanium System Environment from the perspective of an Itanium architecture-based operating system.
  • Page 17: Appendices

    Instruction Set Reference This volume is a comprehensive reference to the Itanium instruction set, including instruction format/encoding. ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. Chapter 2, “Instruction Reference”...
  • Page 18: Terminology

    These resources include instructions and registers. Itanium Architecture – The new ISA with 64-bit instruction capabilities, new performance- enhancing features, and support for the IA-32 instruction set. IA-32 Architecture – The 32-bit and 16-bit Intel architecture as described in the ® Intel 64 and IA-32 Architectures Software Developer’s Manual.
  • Page 19: Revision History

    ® • Intel 64 and IA-32 Architectures Software Developer’s Manual – This set of manuals describes the Intel 32-bit architecture. They are available from the Intel Literature Department by calling 1-800-548-4725 and requesting Document Numbers 243190, 243191and 243192. ® ®...
  • Page 20 Date of Revision Description Revision Number August 2005 Allow register fields in CR.LID register to be read-only and CR.LID checking on interruption messages by processors optional. See Vol 2, Part I, Ch 5 “Interruptions” and Section 11.2.2 PALE_RESET Exit State for details. Relaxed reserved and ignored fields checkings in IA-32 application registers in Vol 1 Ch 6 and Vol 2, Part I, Ch 10.
  • Page 21 Date of Revision Description Revision Number August 2002 Added Predicate Behavior of alloc Instruction Clarification (Section 4.1.2, Part I, Volume 1; Section 2.2, Part I, Volume 3). Added New fc.i Instruction (Section 4.4.6.1, and 4.4.6.2, Part I, Volume 1; Section 4.3.3, 4.4.1, 4.4.5, 4.4.6, 4.4.7, 5.5.2, and 7.1.2, Part I, Volume 2; Section 2.5, 2.5.1, 2.5.2, 2.5.3, and 4.5.2.1, Part II, Volume 2;...
  • Page 22 Date of Revision Description Revision Number Volume 2: Class pr-writers-int clarification (Table A-5). PAL_MC_DRAIN clarification (Section 4.4.6.1). VHPT walk and forward progress change (Section 4.1.1.2). IA-32 IBR/DBR match clarification (Section 7.1.1). ISR figure changes (pp. 8-5, 8-26, 8-33 and 8-36). PAL_CACHE_FLUSH return argument change –...
  • Page 23 Date of Revision Description Revision Number Volume 2: Clarifications regarding “reserved” fields in ITIR (Chapter 3). Instruction and Data translation must be enabled for executing IA-32 instructions (Chapters 3,4 and 10). FCR/FDR mappings, and clarification to the value of PSR.ri after an RFI (Chapters 3 and 4).
  • Page 24: Introduction To The Intel ® Itanium ® Architecture

    Operating Environments The architectural model supports a mixture of IA-32 and Itanium architecture-based applications within a single Itanium architecture-based operating system. Table 2-1 defines the major supported operating environments. ® ® Volume 1, Part 1:Introduction to the Intel Itanium Architecture 1:13...
  • Page 25: Instruction Set Transition Model

    Table 2-1. Major Operating Environments System Application Usage Environment Environment ® ® Itanium System IA-32 Protected Mode IA-32 Protected Mode applications in the Intel Itanium System Environment Environment. ® ® IA-32 Real Mode IA-32 Real Mode applications in the Intel Itanium System Environment.
  • Page 26: Instruction Set Features

    (see “Speculation” on page 1:16). In traditional architectures, procedure calls limit performance since registers need to be spilled and ® ® Volume 1, Part 1: Introduction to the Intel Itanium Architecture 1:15...
  • Page 27: Compiler To Processor Communication

    If the new control speculative load causes an exception, then the exception should only be serviced if (a>b) is true. When ® ® 1:16 Volume 1, Part 1: Introduction to the Intel Itanium Architecture...
  • Page 28: Data Speculation

    To illustrate, an unpredicated instruction r1 = r2 + r3 when predicated, would be of the form ® ® Volume 1, Part 1: Introduction to the Intel Itanium Architecture 1:17...
  • Page 29: Register Stack

    The hardware can exploit the explicit register stack frame information to spill and fill registers from the register stack to memory at the best opportunity (independent of the calling and called procedures). ® ® 1:18 Volume 1, Part 1: Introduction to the Intel Itanium Architecture...
  • Page 30: Branching

    128 floating-point registers are defined. Of these, 96 registers are rotating (not stacked) and can be used to modulo schedule loops compactly. Multiple floating-point status registers are provided for speculation. ® ® Volume 1, Part 1: Introduction to the Intel Itanium Architecture 1:19...
  • Page 31: Multimedia Support

    They are useful for creating high performance compression/decompression algorithms that are used by applications which have sound and video. Itanium multimedia instructions are semantically compatible with HP’s MAX-2* multimedia technology and Intel’s MMX and SSE technology instructions. ®...
  • Page 32: System Performance And Scalability

    This following terms are used in the remainder of this document: • Itanium Instruction Set – The Itanium architecture defines the 64-bit instruction set extensions to the IA-32 architecture. • IA-32 Architecture – The 32-bit and 16-bit Intel architecture as described in the ® Intel 64 and IA-32 Architectures Software Developer’s Manual.
  • Page 33 § ® ® 1:22 Volume 1, Part 1: Introduction to the Intel Itanium Architecture...
  • Page 34: Application Register State

    Execution Environment The architectural state consists of registers and memory. The results of instruction execution become architecturally visible according to a set of execution sequencing rules. This chapter describes the application architectural state and the rules for execution sequencing. See Chapter 6 for details on IA-32 instruction set execution.
  • Page 35: Reserved And Ignored Registers And Fields

    ignore the value written. In variable-sized register sets, registers which are unimplemented in a particular processor are also reserved registers. An access to one of these unimplemented registers causes a Reserved Register/Field fault. Within defined registers, fields which are not defined are either reserved or ignored. For reserved fields, hardware will always return a zero on a read.
  • Page 36: General Registers

    Figure 3-1. Application Register Model APPLICATION REGISTER SET General Registers Floating-point Registers Predicates Branch Registers Application Registers NaTs +0.0 +1.0 BSPSTORE RNAT Instruction Pointer EFLAG Current Frame Marker CFLG User Mask UNAT Advanced Load Performance Monitor FPSR Processor Identifiers Address Table Data Registers cpuid cpuid...
  • Page 37: Floating-Point Registers

    General registers 8 through 31 contain the IA-32 integer, segment selector and segment descriptor registers. See “IA-32 General Purpose Registers” on page 1:117 details on IA-32 register assignments. 3.1.3 Floating-point Registers A set of 128 (82-bit) floating-point registers are used for all floating-point computation.
  • Page 38: Instruction Pointer

    3.1.6 Instruction Pointer The Instruction Pointer (IP) holds the address of the bundle which contains the current executing instruction. The IP can be read directly with a mov ip instruction. The IP cannot be directly written, but is incremented as instructions are executed, and can be set to a new value with a branch.
  • Page 39: Application Registers

    3.1.8 Application Registers The application register file includes special-purpose data registers and control registers for application-visible processor functions for both the IA-32 and Itanium instruction set architectures. These registers can be accessed by Itanium architecture-based applications (except where noted). Table 3-3 contains a list of the application registers.
  • Page 40: Rsc Format

    Application registers can only be accessed by either a M or I execution unit. This is specified in the last column of the table. The ignored registers are for future backward-compatible extensions. Section 10.2, “System Register Model” on page 2:239 for the field definition of each IA-32 application register.
  • Page 41: Bsp Register Format

    Figure 3-4. BSP Register Format pointer 3.1.8.4 RSE Backing Store Pointer for Memory Stores (BSPSTORE – AR 18) The RSE Backing Store Pointer for memory stores is a 64-bit register (Figure 3-5). It holds the address of the location in memory to which the RSE will spill the next value. Section 6.1, “RSE and Backing Store Overview”...
  • Page 42 3.1.8.8 User NaT Collection Register (UNAT – AR 36) The User NaT Collection Register is a 64-bit register used to temporarily hold NaT bits when saving and restoring general registers with the ld8.fill and st8.spill instructions. 3.1.8.9 Floating-point Status Register (FPSR – AR 40) The floating-point status register (FPSR) controls traps, rounding mode, precision control, flags, and other control bits for Itanium floating-point instructions.
  • Page 43: Pfs Format

    System software can secure the resource utilization counter from non-privileged access. When secured, a read of the RUC at any privilege level other than the most privileged causes a Privileged Register fault. The RUC for a logical processor does not count when that logical processor is in LIGHT_HALT, unless all logical processors on a given physical processor are in LIGHT_HALT, in which case the last logical on a given physical processor to enter LIGHT_HALT has its RUC continue to count.
  • Page 44: Performance Monitor Data Registers (Pmd)

    3.1.8.13 Loop Count Register (LC – AR 65) The Loop Count register (LC) is a 64-bit register used in counted loops. LC is decremented by counted-loop-type branches. 3.1.8.14 Epilog Count Register (EC – AR 66) The Epilog Count register (EC) is a 6-bit register used for counting the final (epilog) stages in modulo-scheduled loops.
  • Page 45: Processor Identification Registers

    0: unaligned data memory references may cause an Unaligned Data Reference fault. 1: all unaligned data memory references cause an Unaligned Data Reference fault. ® Lower (f2.. f31) floating-point registers written – This bit is set to one when an Intel ® Itanium instruction that uses register f2..f31 as a target register, completes.
  • Page 46: Cpuid Register 4 - General Features/Capability Bits

    Table 3-7. CPUID Register 3 Fields Field Bits Description number The index of the largest implemented CPUID register (one less than the number of implemented CPUID registers). This value will be at least 4. revision 15:8 Processor revision number. An 8-bit value that represents the revision or stepping of this processor implementation within the processor model.
  • Page 47: Memory

    Table 3-8. CPUID Register 4 Fields (Continued) Field Bits Description Processor implements mpy4 and mpyshl4 instructions (see “tf — Test Feature” instruction in Volume 63:34 Reserved. Memory This section describes an Itanium architecture-based application program’s view of memory. This includes a description of how memory is accessed, for both 32-bit and 64-bit applications.
  • Page 48: Little-Endian Loads

    larger-than-byte loads and stores are big endian (lower-addressed bytes in memory correspond to the higher-order bytes in the register). Load byte and store byte are not affected by the UM.be bit. The UM.be bit does not affect instruction fetch, IA-32 references, or the RSE.
  • Page 49: Bundle Format

    Instruction Encoding Overview Each instruction is categorized into one of six types; each instruction type may be executed on one or more execution unit types. Table 3-9 lists the instruction types and the execution unit type on which they are executed. Table 3-9.
  • Page 50 Table 3-10. Template Field Encoding and Instruction Slot Mapping Template Slot 0 Slot 1 Slot 2 M-unit M-unit I-unit M-unit F-unit I-unit M-unit F-unit I-unit M-unit M-unit F-unit M-unit M-unit F-unit M-unit I-unit B-unit M-unit I-unit B-unit M-unit B-unit B-unit M-unit B-unit B-unit...
  • Page 51 4. Update architectural state, if necessary (update). An instruction group is a sequence of instructions starting at a given bundle address and slot number and including all instructions at sequentially increasing slot numbers and bundle addresses up to the first stop, taken branch, Break Instruction fault due to a break.b, or Illegal Operation fault due to a Reserved or Reserved if PR[qp] is one encoding in the B-type opcode space.
  • Page 52 The ordering rules above form the context for register dependency restrictions, memory dependency restrictions and the order of exception reporting. These dependency restrictions apply only between instructions whose resource reads and writes are not dynamically disabled by predication. • Register dependencies: Within an instruction group, read-after-write (RAW) and write-after-write (WAW) register dependencies are not allowed (except as noted in “RAW Dependency Special Cases”...
  • Page 53 The ordering rules and the dependency restrictions allow the processor to dynamically re-order instructions, execute instructions with non-unit latency, or even concurrently execute instructions on opposing sides of a stop or taken branch, provided that correct sequencing is enforced and the appearance of sequential execution is presented to the programmer.
  • Page 54 br.ia work like other instructions for the purposes of register dependency; i.e., if their qualifying predicate is 0, they are not considered readers or writers of other resources. Branches br.cloop, br.cexit, br.ctop, br.wexit, and br.wtop are exceptional in that they are always readers or writers of their resources, regardless of the value of their qualifying predicate.
  • Page 55 3.4.3 WAR Dependency Special Cases The WAR dependency between the reading of predicate register 63 by any B-type instruction and the subsequent writing of predicate register 63 by a modulo-scheduled loop type branch (br.ctop, br.cexit, br.wtop, or br.wexit) without an intervening stop is not allowed.
  • Page 56 • RAW and WAW register dependencies within the same instruction group are disallowed except as noted in Section 3.4, “Instruction Sequencing Considerations” on page 1:39. Their behavior within an instruction group is undefined. Undefined behavior includes the possibility of an Illegal Operation fault. •...
  • Page 57 1:46 Volume 1, Part 1: Execution Environment...
  • Page 58: Application Programming Model

    64 bits before use. The floating-point programming model is described separately in Chapter 5, “Floating-point Programming Model” in Volume 1. Refer to Volume 3: Intel® Itanium® Instruction Set Reference for detailed information on Itanium instructions. The main features of the programming model covered here are: •...
  • Page 59 The local and output areas of a frame can be re-sized using the alloc instruction which specifies immediates that determine the size of frame (sof) and size of locals (sol). Note: In the assembly language, alloc uses three immediate operands to determine the values of sol and sof: the size of inputs;...
  • Page 60: Register Stack Behavior On Procedure Call And Return

    Figure 4-1. Register Stack Behavior on Procedure Call and Return Instruction Execution Stacked GRs Frame Markers sol sof Local A Output A Caller’s Frame (procA) call Callee’s Frame (procB) Output B After Call alloc Callee’s Frame (procB) Local B Output B After alloc return Caller’s Frame (procA)
  • Page 61: Architectural Visible State Related To The Register Stack

    The flushrs instruction is used to force all previous stack frames out to backing store memory. It stalls instruction execution until all active frames in the physical register stack up to, but not including the current frame are spilled to the backing store by the RSE.
  • Page 62: Integer Arithmetic Instructions

    4.2.1 Arithmetic Instructions Addition and subtraction (add, sub) are supported with regular two input forms and special three input forms. The three input addition form adds one to the sum of two input registers. The three input subtraction form subtracts one from the difference of two input registers.
  • Page 63: Integer Logical Instructions

    Table 4-4. Integer Logical Instructions Mnemonic Operation Logical and Logical or Logical and complement andcm Logical exclusive or 4.2.3 32-bit Addresses and Integers Support for 32-bit addresses is provided in the form of add instructions that perform region bit copying. This supports the virtual address translation model (see “32-bit Virtual Addressing”...
  • Page 64: Instructions To Generate Large Constants

    position of the field are specified by two immediates. This is essentially a shift-right-and-mask operation. A simple right shift by a fixed amount can be specified by using shr with an immediate value for the shift amount. This is just an assembly pseudo-op for an extract instruction where the field to be extracted extends all the way to the left-most register bit.
  • Page 65: Compare Instructions

    Compare Instructions and Predication A set of compare instructions provides the ability to test for various conditions and affect the dynamic execution of instructions. A compare instruction tests for a single specified condition and generates a boolean result. These results are written to predicate registers.
  • Page 66: Compare Type Function

    The 64-bit (cmp) and 32-bit (cmp4) compare instructions compare two registers, or a register and an immediate, for one of ten relations (e.g., >, <=). The compare instructions set two predicate targets according to the result. The cmp4 instruction compares the least-significant 32-bits of both sources (the most significant 32-bits are ignored).
  • Page 67: Compare Outcome With Nat Source Input

    The Unconditional compare type behaves the same as the Normal type, except that if the qualifying predicate is 0, both predicate targets are written with 0. This can be thought of as an initialization of the predicate targets, combined with a Normal compare.
  • Page 68: Memory Access Instructions

    4.3.4 Predicate Register Transfers Instructions are provided to transfer between the predicate register file and a general register. These instructions operate in a “broadside” manner whereby multiple predicate registers are transferred in parallel, such that predicate register N is transferred to/from bit N of a general register.
  • Page 69: State Relating To Memory Access

    Load, store and semaphore instructions are summarized in Table 4-12 and the state related to memory reference instructions is summarized in Table 4-13. Table 4-12. Memory Access Instructions Mnemonic Floating-point Operation General Normal Load Pair Load ldfp Speculative load ld.s ldf.s ldfp.s Advanced load...
  • Page 70 The floating-point load pair instructions load two adjacent single precision (4 bytes each), double precision (8 bytes each), or integer/parallel FP (8 bytes each) numbers into two independent floating-point registers (see the ldfp instruction description for restrictions on target register specifiers). Floating-point load pair instructions can specify base register update, but only by an immediate value equal to double the data size.
  • Page 71 Three types of atomic semaphore operations are defined: exchange (xchg); compare and exchange (cmpxchg); and fetch and add (fetchadd). The xchg target is loaded with the zero-extended contents of the memory location addressed by the first source and then the second source is stored into the same memory location.
  • Page 72 indicates that the register contains a deferred exception token, and that its 64-bit data portion contains an implementation-specific value that software cannot rely upon. In floating-point registers, a deferred exception is indicated by a specific pseudo-zero encoding called the NaTVal (see “Representation of Values in Floating-point Registers”...
  • Page 73 For these instructions, if any source contains a deferred exception token, all predicate targets are either cleared or left unchanged, depending on the compare type (see Table 4-10 on page 1:56). Software can use this behavior to ensure that any dependent conditional branches are not taken and any dependent predicated instructions are nullified.
  • Page 74: State Related To Control Speculation

    • The st8.spill may write a zero to the specified memory location, or • The st8.spill may write the register’s 64-bit data portion to memory, only if that implementation returns a zero into the target register of all NaTed speculative loads, and that implementation also guarantees that all NaT propagating instructions perform all computations as specified by the instruction pages.
  • Page 75: Data Speculation Recovery Using Ld

    4.4.5.1 Data Speculation Concepts An ambiguous memory dependency is said to exist between a store (or any operation that may update memory state) and a load when it cannot be statically determined whether the load and store might access overlapping regions of memory. For convenience, a store that cannot be statically disambiguated relative to a particular load is said to be ambiguous relative to that load.
  • Page 76: Data Speculation Recovery Using Chk

    speculation check (chk.s) in that, if the speculation was successful, execution continues inline and no recovery is necessary; if speculation was unsuccessful, the chk.a branches to compiler-generated recovery code. The recovery code contains instructions that will re-execute all the work that was dependent on the failed data speculative load up to the point of the check instruction.
  • Page 77 3. A new entry is allocated in the ALAT which contains the new ALAT register tag, the load access size, and a tag derived from the physical memory address. The insertion of the new ALAT entry must occur no later in visibility order than the load of the data.
  • Page 78 than the load of the data. If the check load was an ordered check load (ld.c.clr.acq), then it is performed with the semantics of an ordered load (ld.acq). ALAT register tag lookups by advanced load checks and check loads are subject to memory ordering constraints as outlined in “Memory Access Ordering”...
  • Page 79 3. Software accesses the RSE backing store with advanced loads. See Section 6.9, “RSE and ALAT Interaction” on page 2:146 (since RSE stores do not invalidate ALAT entries). 4. Software explicitly changes the virtual to physical register mapping on stacked registers by switching the RSE backing stores.
  • Page 80: Memory Hierarchy

    moved out of the loop by the compiler. This behavior ensures that if the check load fails on one iteration, then the check load will not necessarily fail on all subsequent iterations. Whenever a new entry is inserted into the ALAT or when the contents of an entry are updated, the information written into the ALAT only uses information from the check load and does not use any residual information from a prior entry.
  • Page 81: Locality Hints Specified By Each Instruction Class

    Figure 4-1. Memory Hierarchy Level 1 Level 2 Level N Temporal Temporal Temporal Structure Structure Structure Register Memory Files Non- Non- Non- temporal temporal temporal Structure Structure Structure Cache The temporal structures cache memory accessed with temporal locality; the non-temporal structures cache memory accessed without temporal locality. Both structures assume that memory accesses possess spatial locality.
  • Page 82: Allocation Paths Supported In The Memory Hierarchy

    Each locality hint implies a particular allocation path in the memory hierarchy. The allocation paths corresponding to the locality hints are depicted in Figure 4-2. The allocation path specifies the structures in which the line containing the data being referenced would best be allocated. If the line is already at the same or higher level in the hierarchy no movement occurs.
  • Page 83: Memory Hierarchy Control Instructions And Hint Mechanisms

    The following instructions are defined for flush control: flush cache (fc, fc.i) and flush write buffers (fwb). The fc instruction invalidates the cache line in all levels of the memory hierarchy above memory. If the cache line is not consistent with memory, then it is copied into memory before invalidation.
  • Page 84: Memory Ordering Rules

    Refer to the description sync.i on page 3:259 Volume 3: Intel® Itanium® Instruction Set Reference for an example of self-modifying code. 4.4.7 Memory Access Ordering Memory data access ordering must satisfy read-after-write (RAW), write-after-write (WAW), and write-after-read (WAR) data dependencies to the same memory location.
  • Page 85: Memory Ordering Instructions

    Table 4-21 summarizes memory ordering instructions related to cacheable memory. For definitions of the ordering rules related to non-cacheable memory, cache synchronization, and privileged instructions, refer to Section 4.4.7, “Sequentiality Attribute and Ordering” on page 2:82. Table 4-21. Memory Ordering Instructions Mnemonic Operation Ordered load and ordered check load...
  • Page 86: State Relating To Branching

    Table 4-22. Branch Types (Continued) Mnemonic Function Branch Condition Target Address Invoke the IA-32 instruction set Unconditional Indirect br.ia Counted loop branch Loop count IP-rel br.cloop Modulo-scheduled counted loop Loop count and Epilog IP-rel br.ctop, br.cexit count Modulo-scheduled while loop Qualifying predicate IP-rel br.wtop, br.wexit...
  • Page 87: Instructions That Modify Rrbs

    iteration is started, and another is finished each time around. During the epilog phase, no new iterations are started, but previous iterations are completed (draining the software pipeline). A predicate is assigned to each stage to control the activation of the instructions in that stage (this predicate is called the “stage predicate”).
  • Page 88 There are two categories of software-pipelined loop branch types: counted and while. Both categories have two forms: top and exit. The “top” variant is used when the loop decision is located at the bottom of the loop body. A taken branch will continue the loop while a not-taken branch will exit the loop.
  • Page 89: Whether Prediction Hint On Branches

    only during the epilog phase and is initialized to one more than the number of epilog stages. If the qualifying predicate is zero during the speculative stages of the prolog, EC will be decremented during this part of the prolog, and the initialization value for EC is increased accordingly.
  • Page 90: Predictor Deallocation Hint

    Table 4-28. Predictor Deallocation Hint Completer Operation Don’t deallocate none Deallocate branch information 4.5.3 Branch Predict Instructions Branch predict instructions are entire instructions whose only purpose is to provide early information about future branches. Branch predict instructions provide the following pieces of information: •...
  • Page 91: Parallel Arithmetic Instructions

    saturation form treats both sources as signed and clamps the result to the limits of a signed range. The unsigned saturation form treats one source as unsigned and clamps the result to the limits of an unsigned range. Two variants are defined that treat the second source as either signed (.uus) or unsigned (.uuu).
  • Page 92: Parallel Shift Instructions

    Table 4-29. Parallel Arithmetic Instructions (Continued) Mnemonic Operation 1-byte 2-byte 4-byte Parallel shift left and add with signed saturation pshladd Parallel shift right and add with signed saturation pshradd Parallel compare pcmp Parallel signed multiply of odd elements pmpy.l Parallel signed multiply of even elements pmpy.r Parallel signed multiply and shift right pmpyshr...
  • Page 93: Parallel Data Arrangement Instructions

    Table 4-31. Parallel Data Arrangement Instructions Mnemonic Operation 1-byte 2-byte 4-byte Interleave odd elements from both sources mix.l Interleave even elements from both sources mix.r Arbitrary copy of individual source elements Convert from larger to smaller elements with signed saturation pack.sss Convert from larger to smaller elements with unsigned pack.uss...
  • Page 94 Instructions are provided to transfer between the branch registers and the general registers. The move to branch register instruction can also optionally include branch hints. See “Branch Prediction Hints” on page 1:78. Instructions are defined to transfer between the predicate register file and a general register.
  • Page 95: String Support Instructions

    Table 4-33. String Support Instructions Mnemonic Operation 1-byte 2-byte Locate first zero element, left to right czx.l Locate first zero element, right to left czx.r 4.8.2 Bit Strings The population count instruction (popcnt) writes the number of bits that have a value of 1 in the source register into the target register.
  • Page 96: Floating-Point Register Format

    Floating-point Programming Model The floating-point architecture is fully compliant with the ANSI/IEEE Standard for Binary Floating-Point Arithmetic (Std. 754-1985). There is full IEEE support for single, double, and double-extended real formats. The two IEEE methods for controlling rounding precision are supported. The first method converts results to the double-extended exponent range.
  • Page 97: Floating-Point Register Encodings

    Real numbers reside in 82-bit floating-point registers in a three-field binary format (see Figure 5-1). The three fields are: • The 64-bit significand field, b contains the number's significant 61 .. digits. This field is composed of an explicit integer bit (significand{63}), and 63 bits of fraction (significand{62:0}).
  • Page 98 Table 5-2. Floating-point Register Encodings (Continued) Biased Significand Sign Class or Subclass Exponent i.bb...bb (1 bit) (17-bits) (64-bits) (Explicit Integer Bit is Shown) Pseudo-NaNs 0x1FFFF 0.000...01 through 0.111...11 Pseudo-Infinity 0x1FFFF 0.000...00 Normalized Numbers 0x00001 1.000...00 through 1.111...11 (Floating-point Register Format Normals) through 0x1FFFE Integers or Parallel FP...
  • Page 99: Floating-Point Status Register Format

    Table 5-2. Floating-point Register Encodings (Continued) Biased Significand Sign Class or Subclass Exponent i.bb...bb (1 bit) (17-bits) (64-bits) (Explicit Integer Bit is Shown) IA-32 Stack Double Real Denormals 0x00000 0.000...01...(11)0s (produced when computation model is through IA-32 Stack Double) 0.111...11...(11)0s Double-Extended Real Pseudo-Denormals 0x00000 1.000...00 through 1.111...11...
  • Page 100: Floating-Point Status Field Format

    Table 5-3. Floating-point Status Register Field Description Field Bits Description traps.vd Invalid Operation Floating-Point Exception fault (IEEE Trap) disabled when this bit is set traps.dd Denormal/Unnormal Operand Floating-Point Exception fault disabled when this bit is set traps.zd Zero Divide Floating-Point Exception fault (IEEE Trap) disabled when this bit is traps.od Overflow Floating-Point Exception trap (IEEE Trap) disabled when this bit is set traps.ud...
  • Page 101: Floating-Point Rounding Control Definitions

    fields flags are merely indications of the occurrence of floating-point excep- tions. Flush-to-Zero (FTZ) mode causes results which encounter “tininess” (see “Definition of Tininess, Inexact and Underflow” on page 1:106) to be truncated to the correctly signed zero. Flush-to-Zero mode can be enabled only if Underflow is disabled. If Underflow is enabled then it takes priority and Flush-to-Zero mode is ignored.
  • Page 102: Floating-Point Memory Access Instructions

    If FPSR.sfx.td is set, the FPSR.traps bits are treated as if they are all set (disabled). Note that FPSR.sf0.td is a reserved field which returns 0 when read. Floating-point Instructions This section describes the floating-point instructions. Refer to Volume 3: Intel® Itanium® Instruction Set Reference for a detailed description. 5.3.1...
  • Page 103: Memory To Floating-Point Register Data Translation - Single Precision

    Figure 5-4. Memory to Floating-point Register Data Translation – Single Precision integer sign exponent significand Memory/GR: Single-precision Load/setf.s – normal numbers integer sign exponent significand 0x1FFFF 1111111 1 Memory/GR: Single-precision Load/setf.s – infinities and NaNs integer sign exponent significand 0000000 0 Memory/GR: Single-precision Load/setf.s –...
  • Page 104: Memory To Floating-Point Register Data Translation - Double Precision

    Figure 5-5. Memory to Floating-point Register Data Translation – Double Precision integer sign exponent significand Memory /setf.d Double-precision Load – normal numbers integer sign exponent significand 0x1FFFF 1111111 1 Memory /setf.d Double-precision Load – infinities and NaNs integer sign exponent significand 0000000 0 Memory...
  • Page 105: Memory To Floating-Point Register Data Translation - Double Extended, Integer, Parallel Fp And

    Figure 5-6. Memory to Floating-point Register Data Translation – Double Extended, Integer, Parallel FP and Fill integer sign exponent significand Memory: Double-extended-precision Load – normal/unnormal numbers integer sign exponent significand 0x1FFFF 1111111 1 1111111 Memory: Double-extended-precision Load – infinities and NaNs integer sign exponent...
  • Page 106: Floating-Point Register To Memory Data Translation - Single Precision

    Figure 5-7. Floating-point Register to Memory Data Translation – Single Precision integer sign exponent significand = AND Memory/GR: Single-precision Store/getf.s Figure 5-8. Floating-point Register to Memory Data Translation – Double Precision integer sign exponent significand Memory/GR: Double-precision Store/getf.d = AND Volume 1, Part 1: Floating-point Programming Model 1:95...
  • Page 107: Floating-Point Register To Memory Data Translation - Double Extended, Integer, Parallel Fp And

    Figure 5-9. Floating-point Register to Memory Data Translation – Double Extended, Integer, Parallel FP and Spill integer sign exponent significand Memory/GR: Integer/Parallel FP Store/getf.sig integer sign exponent significand Memory: Double Extended-precision Store integer sign exponent significand Memory: Register Spill Both little-endian and big-endian byte ordering is supported on floating-point loads and stores.
  • Page 108: Floating-Point Register Transfer Instructions

    Figure 5-10. Spill/Fill and Double-extended (80-bit) Floating-point Memory Formats Memory Formats Floating-point Register Format (82-bit) Spill/Fill (128-bit) Double-Extended (80-bit) exp. significand se1’ se2 e1 e0 s6 s5 s2 s1 e0’ se1’ e0’ s6 s5 s2 s1 Double-Extended (80-bit) Interpretation e0’ se1’...
  • Page 109: General Register (Integer) To Floating-Point Register Data Translation (Setf)

    Table 5-9. General Register (Integer) to Floating-point Register Data Translation (setf) General Floating-Point Register (.sig) Floating-Point Register (.exp) Register Class Integer Sign Exponent Significand Sign Exponent Significand ignore NaTVal NaTVal integers 000...00 0x1003E integer integer{17} integer{16:0} 0x8000000000000000 through 111...11 Table 5-10. Floating-point Register to General Register (Integer) Data Translation (getf) Floating-Point Register General Register (.sig) General Register (.exp)
  • Page 110: Arithmetic Floating-Point Pseudo-Operations

    Table 5-12. Arithmetic Floating-point Instructions (Continued) Floating-point minimum fmin.sf fpmin.sf Floating-point maximum fmax.sf fpmax.sf Floating-point absolute minimum famin.sf fpamin.sf Floating-point absolute maximum famax.sf fpamax.sf Convert floating-point to signed integer fcvt.fx.sf fpcvt.fx.sf fcvt.fx.trunc.sf fpcvt.fx.trunc.sf Convert floating-point to unsigned integer fcvt.fxu.sf fpcvt.fxu.sf fcvt.fxu.trunc.sf fpcvt.fxu.trunc.sf Convert signed integer to floating-point...
  • Page 111: Non-Arithmetic Floating-Point Instructions

    The fneg pseudo-operation (see Table 5-15) simply reverses the sign bit of the operand and is therefore not equivalent to the IEEE negation operation. For the IEEE negation operation, an fnma using FR 1 as the multiplicand and FR 0 as the addend must be used.
  • Page 112: Fpsr Status Field Instructions

    with the FPSR.sf0.flags and FPSR.traps. If the flags of the alternate status field indicate the occurrence of an event that corresponds to an enabled floating-point exception in FPSR.traps, or an event that is not already registered in the FPSR.sf0.flags (i.e., the flag for that event in FPSR.sf0.flags is clear), then the fchkf instruction branches to recovery code.
  • Page 113 Exceptions are processed according to a predetermined precedence. Precedence in exception handling means that higher-priority exceptions are flagged first and results are delivered according to the requirements of that exception. Lower-priority exceptions are not flagged even if they occur. For example, dividing an SNaN by zero causes an invalid operation exception (due to the SNaN) and not a zero-divide exception;...
  • Page 114: Floating-Point Exception Fault Prioritization

    Figure 5-11. Floating-point Exception Fault Prioritization Terminal Decision START State Point NaTVal NaTVal Response Operand? Invalid FP Fault Unsupported ISR.v=1 Enabled? Operand? QNaN Ind FLAGS.v=1 Invalid SNaN FP Fault Enabled? ISR.v=1 Operand? FLAGS.v=1 QNaN Reg prioritized Operand? NaN resp (f4,f2,f3) Invalid FP Fault Other Invalid...
  • Page 115 5.4.1.3 Floating-point Exception Trap A Floating-point Exception trap occurs if one of the following four circumstances arises: 1. The processor requests system software assistance to complete the operation, via the Software Assist trap 2. The IEEE Overflow trap is enabled and an overflow occurs 3.
  • Page 116: Definition Of Overflow

    Figure 5-12. Floating-point Exception Trap Prioritization Decision Terminal START State Point tmp_exp=result exponent tmp_sig=result significand tmp_i=inexactness indicator Pre- Infinity Zero tmp_fpa=significand roundup Zero Computed Result Result Res? Inf.Precision Operation Unbounded Range Rounding FP TRAP FP TRAP tmp_exp, tmp_sig FLAGS.u=1 FLAGS.o=1 tmp_i, tmp_fpa FLAGS.i|=tmp_i FLAGS.i|=tmp_i...
  • Page 117: Definition Of Tininess, Inexact And Underflow

    then inexactness is signaled. If the significand was rounded by adding a one to its least significant bit, then bit fpa in ISR.code is set to 1. Finally, an interruption due to a Floating-Point Exception trap will occur. Note that when rounding to single, double, or double-extended real, the overflow trap enabled response for normal (non Parallel FP) arithmetic instructions is not guaranteed to be in the range of a valid single, double, or double-extended real quantity, because it is in 17-bit exponent format.
  • Page 118: Integer Invalid Operations

    performance on implementations that do not implement denormal handling in hardware. When the Flush-to-Zero mode is enabled, floating-point exception software assist traps will not occur when producing tiny results. 5.4.4 Integer Invalid Operations Floating-point to integer conversions which are invalid (in the IEEE sense) signal an Invalid Operation Floating-Point Exception fault.
  • Page 119 • The NaTVal is a natural extension of the IEEE concept of NaNs. It is used to support speculative execution. • Flush-to-Zero mode is an industry standard addition. • The minimum and maximum instructions allow the efficient execution of the common Fortran Intrinsic Functions: MIN(), MAX(), AMIN(), AMAX();...
  • Page 120: Ia-32 Execution Layer

    This section does not cover the details of IA-32 application programming model, IA-32 ® instructions and registers. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for details regarding IA-32 application programming model. ® ® Volume 1, Part 1:IA-32 Application Execution Model in an Intel Itanium System Environment 1:109...
  • Page 121: Instruction Set Modes

    • Itanium instructions can access the entire Itanium and IA-32 application register state. This includes IA-32 segment descriptors, selectors, general registers, physical floating-point registers, MMX technology registers, and SSE registers. See ® ® 1:110 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
  • Page 122 Itanium instruction set. There are two forms; register indirect and absolute. The absolute form computes the Itanium target virtual address as follows: ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:111...
  • Page 123 Itanium instruction set into IA-32 VM86, Real Mode or Protected Mode. While jmpe and interruptions will transition the processor from either IA-32 VM86, Real Mode or ® ® 1:112 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
  • Page 124: Instruction Set Mode Transitions

    To promote straight-forward parameter passing, integer and IEEE floating-point register and memory data types are binary compatible between both IA-32 and Itanium instruction sets. ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:113...
  • Page 125: Ia-32 Application Register Model

    • Undefined: Registers marked as undefined may be used as scratch areas for execution of IA-32 instructions by the processor and are not ensured to be preserved across instruction set transitions. ® ® 1:114 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
  • Page 126: Ia-32 Application Register Mapping

    Instruction Pointer Floating-point Registers constant +0.0 constant +1.0 ® ® FR2-5 unmodified Intel Itanium preserved registers FR6-7 undefined IA-32 code execution space ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:115...
  • Page 127 IA-32 time stamp counter (TSC) ® ® and Intel Itanium Interval Timer unmodified RUC continues to count while in IA-32 execution mode ® ® 1:116 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
  • Page 128: Ia-32 General Registers (Gr8 To Gr15)

    IP is a 64-bit virtual pointer shared with the Itanium instruction set. The following relationship is defined between EIP and IP while executing IA-32 instructions. IP{63:32} = 0; IP{31:0} = EIP{31:0} + CSD.Base; ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:117...
  • Page 129: Ia-32 Segment Register Selector Format

    ® type 55:52 Type identifier for data/code segments, including the Access bit (bit 52). See the Intel 64 and IA-32 Architectures Software Developer’s Manual for encodings and definition. Non System Segment. If 1, a data segment, if 0 a system segment.
  • Page 130 32-bits, otherwise 16-bits. Segment Limit Granularity. If 1, scales the segment limit by lim=(lim<<12) | 0xFFF for ® ® IA-32 instruction set memory references. This field is ignored for Intel Itanium instruction set memory references. 6.2.2.3.1 Data and Code Segments...
  • Page 131: Ia-32 Environment Initial Register State

    Segment limit should be set to 0xFFFF for normal RM 64KB operation. f. For valid segments the p-bit should be set to 1, for null segments the p-bit should be set to 0. ® ® 1:120 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
  • Page 132 • Itanium architecture-based software should ensure PSR.cpl is 0 • Itanium architecture-based software should ensure the stack segment descriptor register’s DPL is 0. ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:121...
  • Page 133: Ia-32 Environment Runtime Integrity Checks

    Stack Fault references to SS read and not readable, write and not writeable s, p, a-bits are not 1 g-bit/limit segment limit violation ® ® 1:122 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
  • Page 134: Eflag Register (Ar24)

    These flags are ignored by Itanium instructions. Flags ID, OF, DF, SF, ZF, ® AF, PF and CF are defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual. ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:123...
  • Page 135: Ia-32 Eflags Register Fields

    IA-32 floating-point register stack, numeric controls and environment are mapped into the Itanium floating-point registers FR8 - FR15 and the application register name space as shown in Table 6-6. ® ® 1:124 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
  • Page 136: Ia-32 Floating-Point Register Mappings

    IA-32 Floating-point Stack IA-32 floating-point registers are defined as follows: • IA-32 numeric register stack is mapped to FR8 - FR15, using the Intel 8087 80-bit IEEE floating-point format. • For IA-32 instruction set references, floating-point registers are logically mapped into FR8 - FR15 based on the IA-32 top-of-stack (TOS) pointer held in FCR.top.
  • Page 137 Nan, Infinity or Denormal of each IA-32 logical floating-point register are not supported. However, IA-32 instruction set reads of FTW compute the additional special ® ® 1:126 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
  • Page 138: Ia-32 Floating-Point Control Register (Fcr)

    Intel Itanium Usage in the Intel IA-32 State Bits IA-32 Usage ® State Itanium Architecture FSW, FTW, MXCSR state in the FSR Register ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:127...
  • Page 139 6.2.2.5.4 IA-32 Floating-point Environment To support the Intel 8087 delayed numeric exception model, FSR, FDR and FIR contain pending information related to the numeric exception. FDR contains the operand’s effective address and segment selector. FIR contains the numeric instruction’s effective address, code segment selector, and opcode bits.
  • Page 140: Floating-Point Data Register (Fdr)

    IA-32 Intel Technology Registers The eight IA-32 Intel MMX technology registers are mapped on the eight Itanium floating-point registers FR8 - FR15 where MM0 is mapped to FR8 and MM7 is mapped to FR15. The MMX technology register mapping for the IA-32 floating-point stack view is dependent on the floating-point IA-32 Top-of-Stack value.
  • Page 141: Sse Registers (Xmm0-Xmm7)

    To avoid performance degradation, software programmers are strongly recommended ® not to intermix IA-32 floating and IA-32 MMX technology instructions. See the Intel 64 and IA-32 Architectures Software Developer’s Manual for MMX technology coding guidelines for details. 6.2.2.7 IA-32 SSE Registers The eight 128-bit IA-32 SSE registers (XMM0-7) are mapped on sixteen physical Itanium floating-point register pairs FR16 - FR31.
  • Page 142: Memory Addressing Model

    Starting 32-bit virtual addresses are truncated to 32-bits after the addition of the segment base. Ending virtual address ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:131...
  • Page 143 • All IA-32 stores have release semantics • All IA-32 loads have acquire semantics • All IA-32 read-modify-write or lock instructions have release and acquire semantics (fully fenced). ® ® 1:132 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
  • Page 144 IA-32 code, existing entries in the ALAT are ignored. For details on the ALAT, refer to Section 4.4.5.2, “Data Speculation and Instructions” on page 1:64. ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:133...
  • Page 145 Software should not rely on the behavior of NaT or NaTVal during IA-32 instruction execution, or propagate NaT or NaTVal into IA-32 instructions. § ® ® 1:134 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
  • Page 146: Part Ii: Optimization Guide For The Intel

    Part II: Optimization Guide for the ® ® Intel Itanium Architecture 1:135 Intel® Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 147 1:136 Intel® Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 148: Overview Of The Optimization Guide

    Itanium instruction set. It is intended for those interested in furthering their understanding of application architecture features and optimization techniques that benefit application performance. Intel and the industry are developing compilers to take advantage of these techniques. Application developers are not advised to use this as a guide to assembly language programming for the Itanium architecture.
  • Page 149 1:138 Volume 1, Part 2: About the Optimization Guide...
  • Page 150: Introduction To Programming For The Intel ® Itanium ® Architecture

    ) that are used for f0-f127 floating-point computations. The first two registers, , are read-only and read as +0.0 and +1.0, respectively. Instructions that write to will fault. ® ® Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture 1:139...
  • Page 151: Using Intel ® Itanium ® Instructions

    (RAW) or write after write (WAW) register dependencies. Instruction groups are delimited by stops in the assembly source code. Since instruction groups have no RAW ® ® 1:140 Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture...
  • Page 152: Bundles And Templates

    ® ® Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture 1:141...
  • Page 153: Memory Access And Speculation

    // Earlier cycle // Other instructions (p1) br.cond.dptk L1;; // Cycle 0 chk.s r3,recovery // Cycle 1 shr r7=r3,r87 // Cycle 1 ® ® 1:142 Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture...
  • Page 154: Predication

    When the value is false (0), the processor discards any results and raises no exceptions. Consider the following C code: if (a) { b = c + d; if (e) { h = i + j; ® ® Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture 1:143...
  • Page 155: Architectural Support For Procedure Calls

    Branches and Hints Since branches have a major impact on program performance, the Itanium architecture includes features to improve their performance by: ® ® 1:144 Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture...
  • Page 156: Branch Instructions

    Thus, after one rotation, the content of register will be found in register and the value of the highest numbered rotating register ® ® Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture 1:145...
  • Page 157: Summary

    • Reduced overhead for procedure calls through the register stack mechanism. • Streamlined loop handling through hardware support of software pipelined loops. • Support for hiding memory latency using speculation. § ® ® 1:146 Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture...
  • Page 158: Overview

    Memory Reference Overview Memory latency is a major factor in determining the performance of integer applications. In order to help reduce the effects of memory latency, the Itanium architecture explicitly supports software pipelining, large register files, and compiler-controlled speculation. This chapter discusses features and optimizations related to compiler-controlled speculation.
  • Page 159: Data Prefetch Hint

    3.2.3 Data Prefetch Hint The lfetch instruction requests that lines be moved between different levels of the memory hierarchy. Like all hint instructions defined in the Itanium architecture, lfetch has no effect on program correctness, and any microarchitecture implementation may choose to ignore it.
  • Page 160: Data Dependencies

    A compiler cannot safely move the load instruction before the branch unless it can guarantee that the moved load will not cause a fatal program fault or otherwise corrupt program state. Since the load cannot be moved upward, the schedule cannot be improved using normal code motion.
  • Page 161: Itanium ® Architecture

    ® ® 3.3.2.2 Data Dependency in the Intel Itanium Architecture The Itanium architecture requires the programmer to insert stops between RAW and WAW register dependencies to ensure correct code results. For example, in the code below, the add instruction computes a value in r4 needed by the sub instruction: r4=r5,r6 ;;...
  • Page 162 *ptr1 = 6; x = *ptr2; ® ® Using Speculation in the Intel Itanium Architecture to Overcome Dependencies Both data and control dependencies constrain optimization of program code. The Itanium architecture provides support for two basic techniques used to overcome dependencies: •...
  • Page 163: Using Data Speculation In The Intel Architecture

    ® ® 3.4.2 Using Data Speculation in the Intel Itanium Architecture Data speculation in the Itanium architecture uses a special load instruction (ld.a) called an advanced load instruction and an associated check instruction (chk.a or ld.c) to validate data-speculated results.
  • Page 164 If no matching entry is found, the speculative results need to be recomputed: • Use a chk.a if a load and some of its uses are speculated. The chk.a jumps to compiler-generated recovery code to re-execute the load and dependent instructions.
  • Page 165 The compiler could move up not only the load, but also one or more of its uses. This transformation uses a chk.a rather than a ld.c instruction to validate the advanced load. Using the same example code sequence but now advancing the add as well as the ld8 results in: ld8.a r6=[r8];;...
  • Page 166 ® ® 3.4.3 Using Control Speculation in the Intel Itanium Architecture The check to determine if control speculation was successful is similar to that for data speculation. 3.4.3.1 The NaT Bit The Not A Thing (NaT) bit is an extra bit on each of the general registers. A register NaT bit indicates whether the content of a register is valid.
  • Page 167: Combining Data And Control Speculation

    Although every speculative computation needs to be checked, this does not mean that every speculative load requires its own chk.s. Speculative checks can be optimized by taking advantage of the propagation of NaT bits through registers as described in Section 3.5.6.
  • Page 168 Optimization of Memory References Speculation can increase parallelism and help to hide latency by enabling more code motion than can be performed on traditional architectures. Speculation can increase the application of traditional loop optimizations such as invariant code motion and common subexpression elimination.
  • Page 169 3.5.2 Data Interference Data references with low interference probabilities and high path probabilities can make the best use of data speculation. In the pseudo-code below, assume the probabilities that the stores to *p1 and *p2 conflict with var are independent. *p1 = /* Prob interference = 0.30 */ .
  • Page 170: Minimizing Code Size During Speculation

    memory conflicts, or aliasing in the ALAT, the decision as to where to place recovery code for advanced loads is more difficult than for control speculation and should be based on the expected conflict rate for each load. As a general rule, efficient compilers will attempt to minimize code growth related to speculation.
  • Page 171 A disadvantage of post-increment loads is that they create new dependencies between post-increment loads and the operations that use the post-increment values. In some cases, the compiler may wish to separate post-increment loads into their component instructions to improve the overall schedule. Alternatively, the compiler could wait until after instruction scheduling and then opportunistically find places where post-increment loads could be substituted for separate load and add instructions.
  • Page 172: Using A Single Check For Three Advanced Loads

    3.5.6 Minimizing Check Code Checks of speculative loads can sometimes be combined to reduce code size. The propagation of NaT bits and NaTVals via speculative instructions can permit a single check of a speculative result to replace multiple intermediate checks. The code below demonstrates this optimization potential: ld4.s r1=[r10]...
  • Page 173 Summary The examples in this chapter show where the Itanium architecture can take advantage of existing techniques like dynamic profiling and disambiguation. Special architectural support allows implementation of speculation in common scenarios in which it would normally not be allowed. Speculation, in turn, increases ILP by making greater code motion possible, thus enhancing traditional optimizations such as those involving loops.
  • Page 174 Predication, Control Flow, and Instruction Stream Overview This chapter is divided into three sections that describe optimizations related to predication, control flow, and branch hints as follows: • The predication section describes if-conversion, predicate usage, and code scheduling to reduce the affects of branching. •...
  • Page 175 ® ® 4.2.2 Predication in the Intel Itanium Architecture Now that the performance implications of branching have been described, this section overviews predication in the Itanium architecture – the primary mechanism used by optimizations described in this section.
  • Page 176 Almost all Itanium instructions can be tagged with a guarding predicate. If the value of the guarding predicate is false at execution time, then the predicated instruction’s architectural updates are suppressed, and the instruction behaves like a nop. If the predicate is true, then the instruction behaves as if it were unpredicated.
  • Page 177 The process of predicating instructions in conditional blocks and removing branches is referred to as if-conversion. Once if-conversion has been performed, instructions can be scheduled more freely because there are fewer branches to limit code motion, and there are fewer branches competing for issue slots. In addition to removing branches, this transformation will make dynamic instruction fetching more efficient since there are fewer possibilities for control flow changes.
  • Page 178: Flow Graph Illustrating Opportunities For Off-Path Predication

    Figure 4-1. Flow Graph Illustrating Opportunities for Off-path Predication Block B Block A If some of the instructions in block A or block B can be included in the main trace without increasing its critical path, then techniques of upward code motion can be applied to reduce the critical path through blocks A and B when they are taken.
  • Page 179 4.2.3.4 Downward Code Motion As with upward code motion, downward code motion is normally difficult in the presence of stores. The next example shows how code can be moved downward past a label, a transformation that is often unsafe without predication: r56 = [r45];;...
  • Page 180 4.2.4.1 Unbalanced Execution Paths The simple conditional below has an unbalanced flow-dependency height. Suppose that non-predicated assembly for this sequence takes two clocks for the if-block and approximately 18 clocks if we assume a setf takes 8 clocks, a getf takes 2 clocks, and an xma takes 6 clocks: if (r4) // 2 clocks...
  • Page 181 4.2.4.4 Case 3 Suppose the if-clause is executed 30% of the time and the branch mispredicts 30% of the time. The average number of clocks for: • Unpredicated code is: (2 cycles * 30%) + (18 cycles * 70%) + (10 cycles * 30%) = 16.2 clocks •...
  • Page 182 4.2.5 Guidelines for Removing Branches The following if-conversion guidelines apply to cases where only local behavior of the code and its execution profile are known: 1. The flow dependency and resource availability heights of both paths must be considered when deciding whether to predicate or not. 2.
  • Page 183 4.3.1 Reducing Critical Path with Parallel Compares The computation of the compound branch condition shown below requires several instructions on processors without special instructions: if ( rA || rB || rC || rD ) { /* If-block instructions */ /* after if-block */ The pseudo-code below, shows one possible solution uses a sequence of branches: cmp.ne p1,p0 = rA,0 cmp.ne p2,p0 = rB,0...
  • Page 184 Initialization code must be placed in an instruction group prior to the parallel compare. However, since the initialization code has no dependencies on prior values, it can generally be scheduled without contributing to the critical path of the code. The instructions below shows how to generate code for the example above using parallel compares: cmp.ne p1,p0 = r0,r0;;...
  • Page 185 An example uses a basic block with four possible successors. The following Itanium architecture-based multi-target branch code uses a BBB bundle template and can branch to either block B, block C, block D, or fall through to block A: label_AA: ...
  • Page 186 The Itanium architecture allows multiple instructions to target the same register in the same clock provided that only one of the instructions writing the target register is predicated true in that clock. Similar capabilities exist for writing predicate registers, as discussed in Section 4.3.1.
  • Page 187 By using predication to reduce the number of control flow changes, the fetching efficiency will generally improve. The only case where predication is likely to reduce instruction cache efficiency is when there is a large increase in the number of instructions fetched which are subsequently predicated off.
  • Page 188 Two types of branch-related hints are defined by the Itanium architecture: branch prediction hints and instruction prefetch hints. Branch prediction hints let the compiler recommend the resources (if any) that should be used to dynamically predict specific branches. With prefetch hints, the compiler can indicate the areas of the code that should be prefetched to reduce demand I-cache misses.
  • Page 189 This scenario can be hinted to the processor by executing an advanced load (ld.a or ld.sa) to the address that this software thread is waiting on, and then by executing a hint @pause instruction (in a subsequent instruction group). This encourages the processor to devote more resources to other threads, yet if an entry is invalidated from this thread's ALAT, normal processor resource allocation is resumed for this thread.
  • Page 190 Resource allocation within the processor eventually reverts to a fair allocation, so there's no need for software to hint that it is no longer in a critical section. Processors that support this hint also ensure that it cannot be abused to affect overall longer-term fairness of processor resource allocation.
  • Page 191 1:180 Volume 1, Part 2: Predication, Control Flow, and Instruction Stream...
  • Page 192: Software Pipelining And Loop Support

    Software Pipelining and Loop Support Overview The Itanium architecture provides extensive support for software-pipelined loops, including register rotation, special loop branches, and application registers. When combined with predication and support for speculation, these features help to reduce code expansion, path length, and branch mispredictions for loops that can be software pipelined.
  • Page 193 This section describes two general methods for overlapping loop iterations, both of which result in code expansion on traditional architectures. The code expansion problem is addressed by loop support features in the Itanium architecture that are explored later in this chapter. The loop above will be used as a running example in the next few sections.
  • Page 194 utilization can be increased by unrolling the loop more times, but at the cost of further code expansion. The loop below is unrolled four times (assuming the trip count is multiple of four): r15 = 4,r5 r25 = 8,r5 r35 = 12,r5 r16 = 4,r6 r26 = 8,r6 r36 = 12,r6;;...
  • Page 195 ® ® Loop Support Features in the Intel Itanium Architecture The code expansion that results from loop optimizations (such as software pipelining and loop unrolling) on traditional architectures can increase the number of instruction cache misses, thus reducing overall performance.
  • Page 196 Itanium architecture allow some loops to be software pipelined without code expansion. Register rotation provides a renaming mechanism that reduces the need for loop unrolling and software renaming of registers. Special software pipelined loop branches support register rotation and, combined with predication, reduce the need to generate separate blocks of code for the prolog and epilog phases.
  • Page 197 for the same source iteration. Each one written to p16 sequentially enables all the stages for a new source iteration. This behavior is used to enable or disable the execution of the stages of the pipelined loop during the prolog, kernel, and epilog phases as described in the next section.
  • Page 198: Ctop And Cexit Execution Flow

    and a decision is made to exit the loop. The special case in which a software-pipelined loop branch is executed with EC equal to 0 can occur in unrolled software-pipelined loops if the target of the cexit branch is set to the next sequential bundle. Figure 5-1.
  • Page 199: Ctop Loop Trace

    Note: Rotating GRs have now been included in the code (the code directly preceding did not). Also, induction variables that are post incremented must be allocated to the static portion of the register file: lc = 199 // LC =loop count - 1 ec = 4 // EC =epilog stages + 1 pr.rot = 1<<16;;...
  • Page 200: Wtop And Wexit Execution Flow

    There are a few differences in the operation of the while loop branch compared to the counted loop branch. The while loop branch does not access LC — a branch predicate determines the behavior of this branch instead. During the kernel and epilog phases, the branch predicate is one and zero respectively.
  • Page 201 Value that is incremented (or decremented) once per source iteration by the same amount. ® ® Optimization of Loops in the Intel Itanium Architecture Register rotation, predication, and the software pipelined loop branches allow the generation of compact, yet highly parallel code. Speculation can further increase loop performance by removing dependency barriers that limit the throughput of software pipelined loops.
  • Page 202: Wtop Loop Trace

    Notice that the load for the second source iteration is executed before the compare and branch of the first source iteration. That is, the load (and the update of r5) is speculative. The loop condition is not computed until cycle X+2, but in order to maximize the use of resources, it is desirable to start the second source iteration at cycle X+1.
  • Page 203 Table 5-2. wtop Loop Trace Port/Instructions State before br.wtop Cycle ld4.s br.wtop … … … … … … … … … ld4.s br.wtop … … … … … … … … … ld4.s br.wtop ld4.s br.wtop ld4.s br.wtop The executions of br.wtop in the first two cycles of the prolog do not correspond to any of the source iterations.
  • Page 204 Below is a possible pipeline with an II of 2, assuming a floating-point load latency of 9 cycles: stage 1: (p16) ldfs f4 = [r5],4 (p16) ldfs f9 = [r8],4;; // empty cycle stage 2-4: --- // empty stages stage 5: // empty cycle (p20) fcmp.ge.unc p1,p2 = f4,f9;;...
  • Page 205 5.5.3.1 Converting Multiple Exit Loops to Single Exit Loops The first is to transform the multiple exit loop into a single exit loop. In the source loop, execution of the add, the second compare and the second branch is guarded by the first branch.
  • Page 206 5.5.3.2 Pipelining with Explicit Multiple Exits The second approach is to combine the last three instructions in the loop into a br.cloop instruction and then pipeline the loop. The pipeline using this approach is shown below: stage 1: ld4.s r4 = [r5],4;; // II = 1 stage 4: ld4.s r9 = [r4];;...
  • Page 207 The following is a possible pipeline with an II of 2: stage 1: r4 = [r5],4 // Cycle 0 r7 = [r8],4;; // Cycle 0 // empty cycle stage 2: // empty cycle [r6] = r4,4 // Cycle 3 [r9] = r7,4;; // Cycle 3 In the source loop, one iteration is completed every three cycles.
  • Page 208 5.5.5.2 Conflicts in the ALAT Using an advanced load to remove a likely invariant load from a loop while advancing another load inside the loop results in poor performance if the latter load targets a rotating register. The advanced load that targets the rotating register will eventually invalidate the ALAT entry for the loop invariant load.
  • Page 209 5.5.6 Loop Unrolling Prior to Software Pipelining In some cases, higher performance can be achieved by unrolling the loop prior to software pipelining. Loops that are resource constrained can be improved by unrolling such that the limiting resource is more fully utilized. In the following example if we assume the target processor has only two memory units, the loop performance is bound by the number of memory units: r4 = [r5],4...
  • Page 210 predicate for the odd iteration is in predicate register X, the stage predicate for the even iteration is in predicate register X-1. The pseudo-code to implement this pipeline assuming an unknown trip count is shown below: r15 = r5,4 r18 = r8,4 lc = r2 // LC = loop count - 1 ec = 4...
  • Page 211 If the loop trip count is even, two epilog stages are executed and the kernel loop is exited at the br.ctop. If the trip count is odd, the first two epilog stages are executed and then the br.cexit branch is taken. Because the target of the br.cexit branch is the next sequential bundle (L4), a third epilog stage is executed before the kernel loop is exited at the br.ctop.
  • Page 212 This loop maintains five independent sums in registers f33-f37. The fma instruction in iteration X produces a result that is used by the fma instruction in iteration X+5. Iterations X through X+4 are independent, allowing an II of one to be achieved. code for a pipelined version of the loop assuming two memory ports and a nine cycle latency for a floating-point load is shown below: lc = 199...
  • Page 213 Note that, in the code above, the ld4 and the add instructions in stage 2 have been reordered. Register rotation has been used to eliminate the WAR register dependency from the add to the ld4. The first two stages are speculative. The code to implement the pipeline is shown below: r36 = [r5] ec = 2...
  • Page 214 under-utilized during the prolog and epilog phases. Part of the prolog and epilog could be peeled off and merged with the code preceding and following the loop. following is a pipelined version of that counted loop with an explicit prolog and epilog: lc = 196 ec = 1 prolog:...
  • Page 215 5.5.9 Redundant Load Elimination in Loops Unrolling of a loop is sometimes necessary to remove copy operations created by loop optimizations. The following is an example of redundant load elimination. In the code below, each iteration loads two values, one of which has already been loaded by the previous source iteration: r8 = r5,4;;...
  • Page 216: Floating-Point Applications

    Floating-point Applications Overview The Itanium floating-point architecture is fully ANSI/IEEE-754 standard compliant and provides performance enhancing features such as the fused multiply accumulate instruction, the large floating-point register file (with static and rotating sections), the extended range register file data representation, the multiple independent floating-point status fields, and the high bandwidth memory access instructions that enable the creation of compact, high performance, floating-point application code.
  • Page 217 6.2.2 Execution Bandwidth When sufficient ILP exists and can be exploited, the performance limitation is the availability of the execution resources – or the execution bandwidth of the machine. Consider the dense matrix multiply kernel from the BLAS3 library. DO 1 i = 1, N DO 1 j = 1, P DO 1 k = 1, M C[i,j] = C[i,j] + A[i,k]*B[k,j]...
  • Page 218 ® ® Floating-point Features in the Intel Itanium Architecture This section highlights architectural features that reduce the impact of the performance limiters described in Section 6.2...
  • Page 219 Here, three registers are required to hold the operands (f5, f6) and the accumulator (f7). By recognizing the reuse of A[i,k] for different B[k,j] as j is varied, and the reuse of B[k,j] for different A[i,k] as i is varied, the computation can be restructured DO 1 i = 1, N, 2 DO 1 j = 1, P, 2 DO 1 k = 1, M...
  • Page 220 If we suppose the minimum floating-point load latency is 9 clocks, and 2 memory operations can be issued per clock, the above loop has to be unrolled by at least six if there is no register rotation. r8 = r7, 8 (p18) [r7] = f25, 16 // Cycle 17,26...
  • Page 221 inputs that might be single precision numbers. With the rounding performed at the 64th precision bit (instead of the 24th for single precision) a smaller error is accumulated with each multiply and add. Furthermore, with 17 bits of range (instead of 8 bits for single precision) large positive and negative products can be added to the accumulator without overflow or underflow.
  • Page 222: Software Divide/Square Root Sequence

    6.3.3 Software Divide/Square Root Sequence To perform division or square root operations on the Itanium architecture, a software-based sequence of operations is used. The sequence consists of obtaining an initial guess (using frcpa/frsqrta instruction) and then refining the guess by performing Newton-Raphson iterations until the error is sufficiently small so that it may not affect the rounding of the result.
  • Page 223: Computational Models

    For divide, the first instruction (frcpa) provides an approximation (good to 8 bits) of the reciprocal of f7 and sets the predicate (p6) to 1, if the ratio f6/f7 can be obtained using the prescribed Newton-Raphson iterations. If, however, the ratio f6/f7 is special (finite/0, finite/infinite, etc) the final result of f6/f7 is provided in f8 and the predicate (p6) is cleared.
  • Page 224: Multiple Status Fields

    6.3.5 Multiple Status Fields The FPSR is divided into one main (architectural) status field and three additional identical status fields. These additional status fields could be used to performance advantage. First, divide and square-root sequences (described in Section 6.3.3) contain operations that might cause intermediate results to overflow/underflow or be inexact even if the final result may not.
  • Page 225: Other Features

    The availability of multiple additional status fields can allow a user to maintain multiple computational environments and to dynamically select among them on an operation by operation basis. One such use is in the implementation of interval arithmetic code where each primitive operation is required to be computed in two different rounding modes to determine the interval of the result.
  • Page 226 Since NaNs are unordered, comparison with NaNs (including LT) will return false. Hence if the above code is implemented as: f5 = [r5], 8;; L1: ldf f6 = [r5], 8 fmin f5 = f6, f5 br.cloop L1 ;; NaNs in the array (X) will be ignored. If the value in the array X (loaded in f6) is a NaN, the new minimum value (in f5) will remain unchanged, since the NaN will fail the.LT.
  • Page 227: Memory Access Control

    architecture provides instructions that allow moving floating-point fields between the integer and floating-point register files. Division of a floating-point number by 2.0 is accomplished as follows: getf.exp = f5 // Move S+Exp to int = r5, -1 // Sub 1 from Exp setf.exp = r5 // Move S+Exp to FP...
  • Page 228: Summary

    The inner loop consists of two loads (for A and B) and a multiply-add (to accumulate the product on C). The loop would run at the latency of the fma due to the recurrence on C. In order to break the recurrence on C, the loop is typically unrolled and multiple partial accumulators are used.
  • Page 229 support in the Itanium architecture beyond the software-pipelining support described in Chapter 5, “Software Pipelining and Loop Support” that help to overcome some of these performance limiters. Architectural support for speculation, rounding, and precision control are also described. Examples in the chapter include how to implement floating-point division and square root, common scientific computations such as reductions, use of features such as the fma instruction, and various Livermore kernels.
  • Page 231: Itanium Architecture

    ® ® Intel Itanium Architecture Software Developer’s Manual Volume 2: System Architecture Revision 2.3 May 2010 Document Number: 245318...
  • Page 232 Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling1-800-548-4725, or by visiting Intel's website at http://www.intel.com.
  • Page 233: Part I: Application Architecture Guide

    Part 1: Application Architecture Guide ......2:3 1.1.2 Part 2: Optimization Guide for the Intel® Itanium® Architecture ..2:3 Overview of Volume 2: System Architecture.
  • Page 234 7.1.1 Data and Instruction Breakpoint Registers ..... . . 2:152 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 235 SAL Entrypoints ......... . . 2:282 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 236 Memory Fences..........2:510 ® ® Memory Ordering in the Intel Itanium Architecture ..... . . 2:510 2.2.1...
  • Page 237 Software-only Deferral ........2:580 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 238 Floating-point System Software ..........2:587 ® ® Floating-point Exceptions in the Intel Itanium Architecture ....2:587 8.1.1...
  • Page 239: Index

    ® ® Intel Itanium System Environment......... . . 2:14 System Register Model .
  • Page 240 Firmware Entrypoints Logical Model ......... . 2:281 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 241 – Cache ........... . 2:424 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 242 Interaction of Ordering and Accesses to Sequential Locations ..... . 2:524 ® ® Why a Fence During Context Switches is Required in the Intel Itanium Architecture . . . 2:526 Spin Lock Code .
  • Page 243 Debug Instructions........2:153 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 244 ISR Values on Interruption ....... . . 2:168 ® ® ISR.code Fields on Intel Itanium Traps ......2:170 Interruption Vectors Sorted Alphabetically .
  • Page 245 Hardware policies returned in cur_policy ......2:395 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 246 Architecture Provides a Relaxed Ordering Model ... . . 2:512 ® ® Acquire and Release Semantics Order Intel Itanium Memory Operations ..2:513 Loads May Pass Stores to Different Locations .
  • Page 247 Interruption Handler Execution Environment (PSR and RSE.CFLE Settings)..2:540 ® ® Preserving Intel Itanium General and Floating-point Registers ....2:549 Register State Preservation at Different Points in the OS .
  • Page 248 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 249 Part I: System Architecture Guide Intel® Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 250 Intel® Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 251 IA-32 application interface. This volume also describes optimization techniques used to generate high performance software. 1.1.1 Part 1: Application Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. ® ®...
  • Page 252 1.2.1 Part 1: System Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. ® ®...
  • Page 253 Chapter 9, “IA-32 Interruption Vector Descriptions” lists IA-32 exceptions, interrupts and intercepts that can occur during IA-32 instruction set execution in the Itanium System Environment. ® Chapter 10, “Itanium Architecture-based Operating System Interaction Model with IA-32 Applications” defines the operation of IA-32 instructions within the Itanium System Environment from the perspective of an Itanium architecture-based operating system.
  • Page 254 Instruction Set Reference This volume is a comprehensive reference to the Itanium instruction set, including instruction format/encoding. ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. Chapter 2, “Instruction Reference”...
  • Page 255 These resources include instructions and registers. Itanium Architecture – The new ISA with 64-bit instruction capabilities, new performance- enhancing features, and support for the IA-32 instruction set. IA-32 Architecture – The 32-bit and 16-bit Intel architecture as described in the ® Intel 64 and IA-32 Architectures Software Developer’s Manual.
  • Page 256 ® • Intel 64 and IA-32 Architectures Software Developer’s Manual – This set of manuals describes the Intel 32-bit architecture. They are available from the Intel Literature Department by calling 1-800-548-4725 and requesting Document Numbers 243190, 243191and 243192. ® ®...
  • Page 257 Date of Revision Description Revision Number August 2005 Allow register fields in CR.LID register to be read-only and CR.LID checking on interruption messages by processors optional. See Vol 2, Part I, Ch 5 “Interruptions” and Section 11.2.2 PALE_RESET Exit State for details. Relaxed reserved and ignored fields checkings in IA-32 application registers in Vol 1 Ch 6 and Vol 2, Part I, Ch 10.
  • Page 258 Date of Revision Description Revision Number August 2002 Added Predicate Behavior of alloc Instruction Clarification (Section 4.1.2, Part I, Volume 1; Section 2.2, Part I, Volume 3). Added New fc.i Instruction (Section 4.4.6.1, and 4.4.6.2, Part I, Volume 1; Section 4.3.3, 4.4.1, 4.4.5, 4.4.6, 4.4.7, 5.5.2, and 7.1.2, Part I, Volume 2; Section 2.5, 2.5.1, 2.5.2, 2.5.3, and 4.5.2.1, Part II, Volume 2;...
  • Page 259 Date of Revision Description Revision Number Volume 2: Class pr-writers-int clarification (Table A-5). PAL_MC_DRAIN clarification (Section 4.4.6.1). VHPT walk and forward progress change (Section 4.1.1.2). IA-32 IBR/DBR match clarification (Section 7.1.1). ISR figure changes (pp. 8-5, 8-26, 8-33 and 8-36). PAL_CACHE_FLUSH return argument change –...
  • Page 260 Date of Revision Description Revision Number Volume 2: Clarifications regarding “reserved” fields in ITIR (Chapter 3). Instruction and Data translation must be enabled for executing IA-32 instructions (Chapters 3,4 and 10). FCR/FDR mappings, and clarification to the value of PSR.ri after an RFI (Chapters 3 and 4).
  • Page 261: System Environment

    ® ® Reset (Intel Itanium Instructions) Platform Test & Initialization ® ® (Intel Itanium IA-32 Instructions) ® Itanium architecture-based OS Boot ® ® (Intel Itanium Instructions & IA-32 Instructions) ® ® Volume 2, Part 1: Intel Itanium System Environment 2:13...
  • Page 262 • Chapter 7, “Debugging and Performance Monitoring” describes debug and performance monitoring hooks. • Chapter 8, “Interruption Vector Descriptions” describes interruption handler entry points. ® ® 2:14 Volume 2, Part 1: Intel Itanium System Environment...
  • Page 263 Chapter 9 describes IA-32 interruption handler entry points. • Chapter 10, “Itanium® Architecture-based Operating System Interaction Model with IA-32 Applications”describes how IA-32 applications interact with Itanium architecture-based operating systems. § ® ® Volume 2, Part 1: Intel Itanium System Environment 2:15...
  • Page 264 ® ® 2:16 Volume 2, Part 1: Intel Itanium System Environment...
  • Page 265 System State and Programming Model This chapter describes the architectural state visible only to an operating system and defines system state programming models. It covers the functional descriptions of all the system state registers, descriptions of individual fields in each register, and their serialization requirements.
  • Page 266 serialization requirements. This approach simplifies hardware and allows for more efficient software operations. For example, during a low level context switch where there is no immediate use of loaded system registers, these registers can be loaded without any serialization overhead. To ensure side effects are observed before a dependent instruction is fetched or executed, two serialization operations are provided: instruction serialization and data serialization.
  • Page 267 The control registers are different from the general registers and other registers. Most control registers require an explicit data serialization between the writing of a control register and the reading of that same control register. (See Table 3-3 on page 2:29 serialization requirements for specific control registers.) The Data Serialize (srlz.d) instruction performs explicit data serialization.
  • Page 268 System State The architecture provides a rich set of system register resources for process control, interruptions handling, protection, debugging, and performance monitoring. This section gives an overview of these resources. 3.3.1 System State Overview Figure 3-1 shows the set of all defined privileged system register resources. Application state as defined in “Application Register State”...
  • Page 269 • Region Registers (RR) – Eight 64-bit region registers specify the identifiers and preferred page sizes for multiple virtual address spaces. Refer to “Region Registers (RR)” on page 2:58 for complete information. • Protection Key Registers (PKR) – At least sixteen 64-bit protection key registers contain protection keys and read, write, execute permissions for virtual memory protection domains.
  • Page 270 Figure 3-1. System Register Model Application Registers APPLICATION REGISTER SET General Registers Floating-point Registers Branch Registers NaTs Predicates +0.0 +1.0 Banked BSPSTORE RNAT Instruction Pointer EFLAG Current Frame Marker CFLG User Mask UNAT Performance Monitor Advanced Load FPSR Data Registers Processor Identifiers Address Table cpuid...
  • Page 271 3.3.2 Processor Status Register (PSR) The PSR maintains the current execution environment. The PSR is divided into four overlapping sections (See Figure 3-2): user mask bits (PSR{5:0}), system mask bits (PSR{23:0}), the lower half (PSR{31:0}), and the entire PSR (PSR{63:0}). PSR fields are defined in Table 3-2 along with serialization requirements for modification of each...
  • Page 272 Lower (f2 .. f31) floating-point registers written – This bit unchanged data is set to one when an Intel Itanium instruction completes that uses register f2..f31 as a target register. This bit is sticky and only cleared by an explicit write of the user mask.
  • Page 273 Upper (f32 .. f127) floating-point registers written – This unchanged data bit is set to one when an Intel Itanium instruction completes that uses register f32..f127 as a target register. This bit is sticky and only cleared by an explicit write of the user mask.
  • Page 274 Table 3-2. Processor Status Register Fields (Continued) Interruption Serialization Field Bits Description State Required Disabled Floating-point High register set – When 1, a data read or write access to f32 through f127 results in a Disabled Floating-Point Register fault. When 1, a Disabled FP Register fault is raised on the first IA-32 target instruction following a br.ia or rfi, regardless whether f32-127 are referenced.
  • Page 275 PSR.cpl is unchanged by the jmpe and br.ia instructions. PSR.cpl cannot be updated by any IA-32 instructions. Instruction Set – When 0, Intel Itanium instructions are , br.ia executing. When 1, IA-32 instructions are executing. Written by the rfi and br.ia instructions and the IA-32 jmpe instruction.
  • Page 276 Table 3-2. Processor Status Register Fields (Continued) Interruption Serialization Field Bits Description State Required Single Step enable – When 1, a Single Step trap occurs following the successful execution of the first restart instruction in the current bundle. Instruction slots 0, 1, and 2 can be single stepped.
  • Page 277 a. User mask bits are implicitly serialized if accessed via user mask instructions; sum, rum, and move to User Mask. If modified with system mask instructions; rsm, ssm and move to PSR.l, software must explicitly serialize to ensure side effects are observed before dependent instructions. b.
  • Page 278 Table 3-3. Control Registers (Continued) Serialization Register Name Description Required Interruption CR16 IPSR Interruption Processor Status Register implied Control CR17 Interruption Status Register implied Registers CR18 reserved CR19 Interruption Instruction Pointer implied CR20 Interruption Faulting Address implied CR21 ITIR Interruption TLB Insertion Register implied CR22 IIPA...
  • Page 279 All unaligned Intel Itanium semaphore references generate an Unaligned Data Reference fault. All aligned Intel Itanium semaphore references made to memory that is neither write-back cacheable nor a NaTPage result in an Unsupported Data Reference fault.
  • Page 280 Table 3-5. Default Control Register Fields (Continued) Serialization Field Description Required Defer Key Miss faults only – When 1, and a Key Miss fault is deferred, data lower priority Access Bit, Access Rights or Debug faults may still be delivered. A Key Miss fault, deferred or not, precludes concurrent Key Permission faults.
  • Page 281 A sequence of reads of the ITC is guaranteed to return ever-increasing values (except for the case of the counter wrapping back to 0) corresponding to the program order of the reads. Applications can directly sample the ITC for time-based calculations. A 64-bit overflow condition can occur without notification.
  • Page 282 A sequence of reads of the RUC is guaranteed to return ever-increasing values (except for the case of the counter wrapping back to 0) corresponding to the program order of the reads. Applications can directly sample the RUC for active-running-time calculations.
  • Page 283 3.3.4.5 Interruption Vector Address (IVA – CR2) The IVA specifies the location of the interruption vector table in the virtual address space, or the physical address space if PSR.it is 0, see Figure 3-7. The size of the vector table is 32K bytes and is 32K byte aligned. The lower 15 bits of the IVA are ignored when written, reads return zeros.
  • Page 284 3.3.5 Interruption Control Registers Registers CR16 - CR27 record information at the time of an interruption (including from the IA-32 instruction set) and are used by handlers to process the interruption. The interruption control registers can only be read or written while PSR.ic is 0; otherwise, an Illegal Operation fault is raised.
  • Page 285 (the processor was performing a data memory accesses to the IDT, GDT, LDT or TSS segments) or an IA-32 data memory access at a privilege level of zero. This bit is always 0 for interruptions taken while executing Intel Itanium instructions.
  • Page 286 Figure 3-10, all 64-bits of the IIP must be implemented regardless of the size of the physical and virtual address space supported by the processor model (see “Unimplemented Address Bits” on page 2:73). IIP also receives byte-aligned IA-32 instruction pointers. The IIP, IPSR and IFS are used to restore processor state on a Return From Interruption instruction (rfi).
  • Page 287 faulting instruction and IIP points to the first byte of the faulting instruction, or (2) for faults on the second page, IFA contains the bundle address of the second virtual page and IIP points to the first byte of the faulting IA-32 instruction. The IFA also specifies a translation’s virtual address when a translation entry is inserted into the instruction or data TLB.
  • Page 288 3.3.5.6 Interruption Instruction Previous Address (IIPA – CR22) For Itanium instructions, IIPA records the last successfully executed instruction bundle address. For IA-32 instructions, IIPA records the byte granular virtual instruction address zero extended to 64-bits of the faulting or trapping IA-32 instruction. In the case of a fault, IIPA does not report the address of the last successfully executed IA-32 instruction, but rather the address of the faulting IA-32 instruction.
  • Page 289 3.3.5.7 Interruption Function State (IFS – CR23) The IFS register is used to reload the current register stack frame (CFM) on a Return From Interruption (rfi). If the IFS is accessed while PSR.ic is 1, an Illegal Operation fault is raised. The IFS can only be accessed at privilege level 0; otherwise, a Privileged Operation fault is raised.
  • Page 290 3.3.5.10 Interruption Instruction Bundle Registers (IIB0-1 – CR26, 27) On an interruption and if PSR.ic is 1, the IIB registers receive the 16-byte instruction bundle corresponding to the interruption. The bundle reported in the IIB registers is the bundle exactly as it was fetched for execution of the instruction which raised the interruption.
  • Page 291 • An interruption selects bank 0, • rfi switches to the bank specified by IPSR.bn, or • bsw switches to the specified bank. On an interruption or bank switch, the processor ensures all prior register accesses (reads and writes) are performed to the prior register bank. Data values in banked registers are preserved across bank switches and both banks maintain NaT values when loaded from general registers.
  • Page 292 Processor Virtualization Processors in the Itanium Processor Family may optionally implement a mechanism to support processor virtualization. This includes an additional PSR.vm bit (see Section 3.3.2, “Processor Status Register (PSR)”), which, when 1, causes certain instructions to take a Virtualization fault (see Section 5.6, “Interruption Priorities”...
  • Page 293 Addressing and Protection This chapter defines operating system resources to translate 64-bit virtual addresses into physical addresses, 32-bit virtual addressing, virtual aliasing, physical addressing, memory ordering and properties of physical memory. Register state defined to support virtual memory management is defined in Chapter 3, while Chapter 5...
  • Page 294 Figure 4-1. Virtual Address Spaces Virtual Address 8 Virtual Regions Bytes 4K to 256M Per Region Pages Virtual Address Spaces By assigning sequential region identifiers, regions can be coalesced to produce larger 62-, 63- or 64-bit spaces. For example, an operating system could implement a 62-bit region for process private data, 62-bit region for I/O, and a 63-bit region for globally shared data.
  • Page 295 Virtual addressing for instruction references are enabled when PSR.it is 1, data references when PSR.dt is 1, and register stack accesses when PSR.rt is 1. Figure 4-2. Conceptual Virtual Address Translation for References Region Virtual Address Registers 63 61 60 Region ID Virtual Region Number (VRN) Virtual Page Number (VPN)
  • Page 296 The TLB is a local processor resource; installation of a translation or local processor purges do not affect other processor’s TLBs. Global TLB purges are provided to purge translations from all processors within a TLB coherence domain in a multiprocessor system.
  • Page 297 4.1.1.2 Translation Cache (TC) The Translation Cache (TC) is an implementation-specific structure defined to hold the large working set of dynamic translations for memory references (including IA-32). Please see the processor-specific documentation for further information on Itanium processor TC implementation details. The processor directly controls the replacement policy of all TC entries.
  • Page 298 inserted TC entry may be occasionally removed before this point, and software must be prepared to re-insert the TC entry on a subsequent fault. For example, eager or mandatory RSE activity, speculative VHPT walks, or other interruptions of the restart instruction may displace the software-inserted TC entry, but when software later re-inserts the same TC entry, the processor must eventually complete the restart instruction to ensure forward progress, even if that restart instruction takes other faults which must be handled before it can complete.
  • Page 299 4.1.1.4 Purge Behavior of TLB Inserts and Purges Translations contained in the translation caches (TC) and translation registers (TR) are maintained in a consistent state by ensuring that TLB insertions remove existing overlapping entries before new TR or TC entries are installed. Similarly, TLB purges that partially or fully overlap with existing translations may remove all overlapping entries.
  • Page 300 Note: Please refer to Table 4-1 for footnotes in Table 4-2. Table 4-1. Purge Behavior of TLB Inserts and Purges Case Insert? Purge? Machine Check? it[cr].[id] overlaps [ID]TC Must Must Must not it[cr].[id] overlaps [DI]TC Must Must not it[cr].[id] overlaps [ID]TR Must it[cr].[id] overlaps [DI]TR Must...
  • Page 301 Table 4-2. Purge behavior of VHPT Inserts VRN bits used for TLB searching on VHPT insert VRN bits not used for TLB searching on VHPT insert VRN Match No VRN Match Case Machine Machine Machine Insert? Purge? Insert? Purge? Insert? Purge? Check? Check?
  • Page 302 • The GR[r] value is checked when a TLB insert instruction is executed, and if reserved fields or reserved encodings are used, a Reserved Register/Field fault is raised on the TLB insert instruction. If GR[r]{0} is zero (not-present Translation Insertion Format), the rest of GR[r] is ignored. •...
  • Page 303 Accessed bit on a reference. GR[r]{6} Dirty Bit – When 0 and PSR.da is 0, Intel Itanium store or semaphore references to the page cause a Data Dirty Bit fault. When 0, IA-32 store or semaphore references to the page cause a Data Dirty Bit fault. The processor does not update the Dirty bit on a store or semaphore reference.
  • Page 304 Figure 4-6. Translation Insertion Format – Not Present 32 31 12 11 GR[r] ITIR rv/ci rv/ci RR[vrn] rv ig 4.1.1.6 Page Access Rights Page granular access controls use 4 levels of privilege. Privilege level 0 is the most privileged and has access to all privileged instructions; privilege level 3 is least privileged.
  • Page 305 Table 4-4. Page Access Rights (Continued) Privilege Level TLB.ar TLB.pl Description read, write, execute / read, write – – – – – – exec, promote / read, execute a. RSC.pl, for RSE fills and spills; PSR.cpl for all other accesses. b.
  • Page 306 Table 4-5. Architected Page Sizes Page Sizes 256k 256M Insertable Purgeable Page sizes are encoded in translation entries and region registers as a 6-bit encoded page size field. Each field specifies a mapping size of 2 bytes, thus a value of 12 represents a 4K-byte page.
  • Page 307 Table 4-6. Region Register Fields (Continued) Field Bits Description Preferred page Size – Selects the virtual address bits used in hash functions for set-associative TLBs or the VHPT. Encoded as 2 bytes. The processor may make significant performance optimizations for the specified preferred page size for the region.
  • Page 308 Processor models have at least 16 protection key registers, and at least 18-bits of protection key. Some processor models may implement additional protection key registers and protection key bits. Unimplemented bits and registers are reserved. Key registers have at least as many implemented key bits as region registers have rid bits. Additional implemented bits must be contiguous and start at bit 18.
  • Page 309 Table 4-8. Translation Instructions (Continued) Instr. Serialization Mnemonic Description Operation Type Requirement Insert data DTC = GR[r ], IFA, ITIR data itc.d r translation cache Insert instruction ITR[GR[r ]] = GR[r ], IFA, ITIR inst itr.i itr[r ] = r translation register Insert data...
  • Page 310 Figure 4-9. Virtual Hash Page Table (VHPT) Virtual Address PTA.size VHPT Region Optional Collision Search Chain Registers Install Optional Operating System Page Tables Hashing Function PTA.base The processor does not manage the VHPT or perform any writes into the table. Software is responsible for insertion of entries into the VHPT (including replacement algorithms), dirty/access bit updates, invalidation due to purges and coherency in a multiprocessor system.
  • Page 311 fault is raised. If the region-based short-format VHPT entry contains no reserved bits or encodings, it is installed into the TLB, and the processor again attempts to translate the failed instruction or data reference. If the long-format VHPT entry’s tag specifies the correct region identifier and virtual address, and the entry contains no reserved bits or encodings, it is installed into the TLB, and the processor again attempts to translate the failed instruction or data reference.
  • Page 312 • Protection Key – specified by the accessed region identifier value (RR[VA{63:61}].rid). As a result, all implementations must ensure that the number of implemented key bits is greater than or equal to the number of implemented region identifier bits. If a translation is marked as not present, ignored fields are usable by software as noted Figure 4-11.
  • Page 313 Figure 4-13. VHPT Not-present Long Format offset 32 31 2 1 0 For multiprocessor systems, atomic updates of long-format VHPT entries may be ensured by software as follows: • Before making multiple non-atomic updates to a VHPT entry in memory, software is required to set its ti bit to one.
  • Page 314 in which the VHPT is enabled, the operating system is required to maintain a per-region linear page table. As defined in Figure 4-14, the VHPT walker uses the virtual address, the region’s preferred page size, and the PTA.size field to compute a linear index into the short-format VHPT.
  • Page 315 the tag (ti bit) is zero for all valid tags. The hash index and tag together must uniquely identify a translation. The processor must ensure that the indices into the hashed table, the region’s preferred page size, and the tag specified in an indexed entry can be used in a reverse hash function to uniquely regenerate the region identifier and virtual address used to generate the index and tag.
  • Page 316 operating systems must ensure that the VHPT is aligned on the natural boundary of the structure; otherwise, processor operation is undefined. For example, a 64K-byte table must be aligned on a 64K-byte boundary. VHPT walker references to the VHPT are performed at privilege level 0, regardless of the state of PSR.cpl.
  • Page 317 4.1.8 Translation Searching The general sequence of searching the TLB and VHPT is shown in Figure 4-16. On a failed TLB search, if the VHPT walker is disabled for the referenced region an Alternate Instruction/Data TLB Miss fault is raised. If the VHPT walker is enabled for the referenced region, the VHPT is accessed to locate the missing translation.
  • Page 318 Figure 4-16. TLB/VHPT Search Virtual Address Virtual Address Unimplemented Data Address fault Implemented VA? Found Found Search TLB Search TLB Not Found Not Found Data Nested TLB fault Data PSR.ic Inst VHPT Walker Enabled Alternate Instruction TLB Miss fault VHPT Walker Enabled 1/In-flight Alternate Data TLB Miss fault...
  • Page 319 Table 4-10. TLB and VHPT Search Faults (Continued) Fault Description Instruction/Data TLB Miss Raised when the VHPT walker is enabled, but the processor: • Cannot locate the required VHPT entry, or • The processor aborts the VHPT search for implementation-specific reasons, or •...
  • Page 320 In the sign-extension model, software ensures that the upper 32-bits of a virtual address are always equal to bit 31. Address computations use the add, shladd, and sxt instructions. This model splits the 32 bit address space into two halves that are spread into 2 bytes of virtual regions 0 and 7 within the 64-bit virtual address space.
  • Page 321 Physical Addressing Objects in memory and I/O occupy a common 63-bit physical address space that is accessed using byte addresses. Accesses to physical memory and I/O may be performed via virtual addresses mapped to the 63-bit physical address space or by direct physical addressing.
  • Page 322 significant implemented physical address bit. In a processor that implements all physical address bits, IMPL_PA_MSB is 62. Please see the processor-specific documentation for further information on the number of physical address bits implemented on the Itanium processor. If unimplemented physical address bits are set by software, an Unimplemented Data Address fault is raised during the TLB insert instructions (itc, itr).
  • Page 323 4.3.3 Instruction Behavior with Unimplemented Addresses The use of an unimplemented address affects instruction execution as described in the bullet list below. If instruction address translation is enabled, an “unimplemented address” refers to an unimplemented virtual address. If instruction address translation is disabled, an “unimplemented address”...
  • Page 324 Table 4-11. Virtual Addressing Memory Attribute Encodings Coherent with Attribute Mnemonic ma Cacheability Write Policy Speculation Respect to Write Back Cacheable Write back WB, WBL Non-sequential & Write speculative Coalescing Not MP coherent Coalescing Uncacheable Uncacheable Sequential & Non-coalescing UC, UCE Uncacheable non-speculative Exported...
  • Page 325 Table 4-12. Physical Addressing Memory Attribute Encodings Coherent with Bit{63} Mnemonic Cacheability Write Policy Speculation respect to Cacheable Write Back Non-sequential & WBL, WB limited speculation Uncached Non-coalescing Sequential & UC, UCE non-speculative a. Coherency here refers to multiprocessor coherence on normal, side-effect free memory. “Speculation Attributes”...
  • Page 326 maintain coherency between processor local instruction and data caches for IA-32 code. Instruction caches are also not required to be coherent with multiprocessor Itanium instruction set originated memory references. Instruction caches are required to be coherent with multiprocessor IA-32 instruction set originated memory references. The processor must ensure that transactions from other I/O agents (such as DMA) are physically coherent with the instruction and data cache.
  • Page 327 become flushed and made visible prior to itself becoming visible. Even though IA-32 stores and loads are ordered, the write-coalesced data is not flushed unless the IA-32 stores or loads are to uncached memory types. The Flush Cache (fc, fc.i) instruction flushes all write-coalesced data whose address is within at least 32 bytes of the 32-byte aligned address specified by the Flush Cache (fc, fc.i) instruction, forcing the data to become visible.
  • Page 328 Prefetches are enabled if a speculative translation exists. Prefetches are asynchronous data and instruction memory accesses that appear logically to initiate and finish between some pair of instructions. This access may not be visible to subsequent flush cache (fc, fc.i) and/or TLB purge instructions. This behavior is implementation-dependent.
  • Page 329 a. Speculative or speculative advanced loads that cause deferred exceptions result in failed speculation. The processor aborts the reference. If the target of the load is a GR, the processor sets the register’s NaT bit to one. If the target of the load is an FR, the processor sets the target FR to NaTVal. The processor performs all other side-effects (such as post-increment).
  • Page 330 • It takes an External interrupt, but if it had not taken an External interrupt, it would have met one of the above qualifications (execute without fault, take an Unaligned Data Reference fault, or take a Data Debug fault) Data-speculative loads are treated the same as normal loads, and if an in-order execution of the program requires the execution of a data speculative load, it constitutes a verified reference.
  • Page 331 Table 4-15. Ordering Semantics and Instructions Ordering ® ® Description Orderable Intel Itanium Instructions Semantics Unordered instructions may become visible in ld, ld.s, ld.a, ld.sa, ld.fill, any order. ldf, ldf.s, ldf.sa, ldf.fill, ldfp, ldfp.s, ldfp.sa,...
  • Page 332 Inter-Processor Interrupt Messages (8-byte stores to a Processor Interrupt Block address, through a UC memory attribute) are exceptions to the sequential semantics. IPI's are not ordered with respect to other IPI's directed at the same processor. Further, fence operations do not enforce ordering between two IPI's. See Section 5.8.4.2, “Interrupt and IPI Ordering”...
  • Page 333 accesses of different sizes but with overlapping memory references appear to complete non-atomically. To ensure that a memory write is globally observed prior to a memory read, software must place an explicit fence operation between the two operations. Aligned st.rel and semaphore operations from multiple processors to cacheable write-back memory become visible to all observers in a single total order (i.e., in a particular interleaving;...
  • Page 334 ld x = [b] cmp.eq p1 = x, ‘new’ (p1) br target target: ld y = [a] if the second processor observes the store to [b], it will also observe the store to [a]. The flush cache (fc, fc.i) instruction follows data dependency ordering. fc and fc.i are ordered only with respect to previous and subsequent load, store, or semaphore instructions to the same line, regardless of the specified memory attribute.
  • Page 335 Page Consumption fault. cmpxchg and xchg accesses to pages with other memory attributes cause an Unsupported Data Reference fault. • fetchadd: The fetchadd instruction can be executed successfully only if the access is to a cacheable page with write-back write policy or to a UCE page. fetchadd accesses to NaTPages cause a Data NaT Page Consumption fault.
  • Page 336 undefined behavior; when changing an existing page from speculative to non-speculative (or vice-versa), software should ensure that any ALAT entries corresponding to that page are invalidated. Limited speculation pages behave like non-speculative pages with respect to speculative advanced loads, and behave like speculative pages with respect to all other advanced and/or check loads.
  • Page 337 3. mf ;; // Ensure visibility of ptc.ga to local data stream srlz.i ;; // Ensure visibility of ptc.ga to local instruction stream After step 3, no processor in the coherence domain will initiate new memory references or prefetches to the old translation. Note, however, that memory references or prefetches initiated to the old translation prior to step 2 may still be in progress after step 3.
  • Page 338 9. Call PAL_MC_DRAIN 10. Using the IPI mechanism defined in “Inter-processor Interrupt Messages” on page 2:128 to reach all processors in the coherence domain, perform step 9 above on all processors in the coherence domain, and wait for all PAL_MC_DRAIN calls to complete on all processors in the coherence domain before continuing.
  • Page 339 // Ensure cache flushes are also seen by processors' instruction fetch sync.i ;; After step 3, all flush cache instructions initiated in step 3 are visible to all processors in the coherence domain, i.e., no processor in the coherence domain will respond with a cache line hit on a memory reference to an address belonging to page “X.”...
  • Page 340 3. Execute: mf ;; srlz.i ;; (The ensures visibility of ptr.d, ptr.i, or ptc.ga to both data and instruction stream, so that no new prefetches will be done to the old translations.) 4. Call PAL_PREFETCH_VISIBILITY with the input argument trans_type equal to one to indicate that the transition is for all memory attributes.
  • Page 341 8. If PAL_CACHE_FLUSH is used to flush caches, it must also be called on all processors in the coherency domain. In any case, PAL_MC_DRAIN must be called on all processors. Using the IPI mechanism defined in Section 5.8.4.1, “Inter-processor Interrupt Messages” on page 2:128 to reach all processors in the coherence domain, perform step 6.a, if necessary, and step 7 above in that order on all processors in the coherence domain, and wait for all PAL_MC_DRAIN...
  • Page 342 boundaries respectively to avoid generation of an Unaligned Data Reference fault. When PSR.ac is 1, any IA-32 data memory reference that is not aligned on a boundary the size of the operand results in an IA_32_Exception(AlignmentCheck) fault. Note: 10-byte and floating-point load double pair datum alignment is 16-bytes. The alignment of long format 32-byte VHPT references is always 32-bytes.
  • Page 343 Interruptions Interruptions are events that occur during instruction processing, causing the flow control to be passed to an interruption handling routine. In the process, certain processor state is saved automatically by the processor. Upon completion of interruption processing, a return from interruption (rfi) is executed which restores the saved processor state.
  • Page 344 Non-Maskable Interrupts are used to request critical operating system services. NMIs are assigned external interrupt vector number 2. • External Controller Interrupts (ExtINT) External Controller Interrupts are used to service Intel 8259A-compatible external interrupt controllers. ExtINTs are assigned locally within the processor to external interrupt vector number 0.
  • Page 345 and all previous instructions are completed. Subsequent instructions have no effect on machine state. Traps are IVA-based interruptions. Figure 5-1 summarizes the above classification. Figure 5-1. Interruption Classification Aborts Interrupts Faults Traps INIT RESET (NMI, ExtINT, ...) PAL-based Interruptions IVA-based Interruptions Unless otherwise indicated, the term “interruptions”...
  • Page 346 Upon an interruption, asynchronous events such as external interrupt delivery are disabled automatically by hardware to allow software to either handle the interruption immediately or to safely unload the interruption resources and save them to memory. Software will either deal with the cause of the interruption and rfi back to the point of the interruption, or it will establish a new environment and spill processor state to memory to prepare for a call to higher-level code.
  • Page 347 4. For Itanium architecture-based code, the processor checks for a valid register stack frame. • If incomplete and RSE Current Frame Load Enable (RSE.CFLE) is set, then perform a mandatory RSE load and start again at step one. The mandatory load operation may fault.
  • Page 348 breakpoint faults. The IA-32 effective instruction address (EIP) is converted into a 64-bit virtual linear address IP and IA-32 defined code segmentation and code fetch faults are checked and may result in a fault. 7. When PSR.is is 0, the bundle is fetched using the IP. When PSR.is is 1, an IA-32 instruction is fetched using IP.
  • Page 349 • If more than one trap is triggered (such as Unimplemented Instruction Address trap, Lower-Privilege Transfer trap, and Single Step trap) the highest priority trap is taken. The ISR.code contains a bit vector with one bit set for each trap triggered.
  • Page 350 branch-related traps, IIP is written with the target of the branch; for all other traps, IIP is written with the address of the bundle or IA-32 instruction containing the next sequential instruction. • IIPA receives the IP of the last successfully executed Itanium instruction. For IA-32 instructions, IIPA receives the IP of the faulting or trapping IA-32 instruction.
  • Page 351 registers, overlapping GR16 to GR31. Which set of physical registers are accessed through GR16 to GR31 is determined by the PSR.bn bit. On an interruption this bit is forced to zero allowing access to the alternate set of 16 registers which can be used as scratch space or to hold predetermined values.
  • Page 352 These non-access Itanium instructions can cause interruptions: fc, fc.i, lfetch.fault, probe, probe.fault, tpa, and tak. (tak can cause interruptions only for non-TLB reasons.) ISR.code will be set to indicate which non-access instruction caused the interruption. See Table 5-1 for ISR field settings for non-access instructions. Table 5-1.
  • Page 353 5.5.5 Deferral of Speculative Load Faults Speculative and speculative advanced loads can defer fault handling by suppressing the speculative memory reference, and by setting the deferred exception indicator (NaT bit or NaTVal) of the load target register. Other effects of the instruction (such as post increment) are performed.
  • Page 354 Aborts, external interrupts, RSE or instruction-fetch-related faults that happen to occur on a speculative load are always raised (since they are not related to the speculative load instruction). Illegal Operation faults and Disabled Floating-point Register faults that occur on a speculative load are always raised. Processing of exception conditions for speculative and speculative advanced loads is done in three stages: qualification, deferral and prioritization.
  • Page 355 Deferral is controlled by PSR.ed, PSR.it, PSR.ic, the speculative deferral control bits in the DCR, the exception deferral bit of the code page’s instruction TLB entry (ITLB.ed), and the memory attribute of the referenced data page. The speculative load and speculative advanced load exception deferral conditions are as follows: •...
  • Page 356 exception condition which is neither precluded nor deferred. Prioritization of non-deferred speculative load faults follows the same interruption priorities as non-speculative instruction faults (Table 5-6 on page 2:109). However, deferred speculative load faults do not take part in the prioritization. As a result, depending on DCR settings, a lower priority fault may be taken, even if a higher priority exception condition exists, but is deferred.
  • Page 357 Interruption Name Vector Name Class Aborts PALE_RESET vector Machine Reset (RESET) IA-32, PALE_CHECK vector Machine Check (MCA) Intel Interrupts PALE_INIT vector Itanium Initialization Interrupt (INIT) PALE_PMI vector Platform Management Interrupt (PMI) External Interrupt vector External Interrupt (INT) Virtual External Interrupt vector...
  • Page 358 Vector Name Class Disabled FP-Register vector Disabled Floating-point Register fault IA-32, General Exception vector Disabled Instruction Set Transition fault Intel Itanium IA-32 Exception vector (DNA) IA-32 Device Not Available fault IA-32 IA-32 Exception vector (FPError) IA-32 FP Error fault IA-32,...
  • Page 359 Table 5-6. Interruption Priorities (Continued) IA-32 Type Instr. Set Interruption Name Vector Name Class IA-32 Intercept vector (SystemFlag) IA-32 System Flag Intercept trap IA-32 Intercept vector (Gate) IA-32 Gate Intercept trap IA-32 Exception vector (Overflow) IA-32 INTO trap IA-32 Exception vector (Break) IA-32 Breakpoint (INT 3) trap IA-32 IA-32 Interrupt vector (Vector#)
  • Page 360 greater than the page boundary, any Instruction TLB faults on the second page have higher priority than the IA-32 Code Fetch fault. Class B Faults from decoding an instruction. Priority of IA-32 Instruction Length, – IA-32 Invalid Opcode, and IA-32 Instruction Intercept, Disabled Floating Point Register, Disabled Instruction Set Transition, and Device Not Available faults are model specific.
  • Page 361 IVA-based Interruption Vectors Table 5-7 contains the processor’s interruption vector table (IVT). The base of the IVT is held in the IVA control register. The size of the IVT is 32KB. The first 20 vectors are designed to provide more code space by allowing 64 bundles per vector (16 bytes per bundle) for performance-critical interruption handlers.
  • Page 362 Table 5-7. Interruption Vector Table (IVT) (Continued) Offset Vector Name Interruption(s) Page 0x5900 Debug vector 18, 31, 62 2:200 0x5a00 Unaligned Reference vector 2:201 0x5b00 Unsupported Data Reference vector 2:202 0x5c00 Floating-point Fault vector 2:203 0x5d00 Floating-point Trap vector 2:204 0x5e00 Lower-Privilege Transfer Trap vector 72, 74...
  • Page 363 (LINT, INIT, PMI) , and are always directed to the local processor. The LINT pins can be connected directly to an Intel 8259A-compatible external interrupt controller. The LINT pins are programmable to be either edge-sensitive or level-sensitive, and for the kind of interrupt that gets generated. If programmed to generate external interrupts, the vector number is a programmed constant per LINT pin.
  • Page 364 • Internal processor interrupts such as interval timer, performance monitoring, – and corrected machine checks. These are always directed to the local processor. A unique vector number can be programmed for each source. • Other processors A processor can interrupt any individual processor, including –...
  • Page 365 • The priority of interrupts is defined in Table 5-8. Entry A is higher priority than interrupt B, if entry A appears at a higher location in the table than entry B. Interrupt priority is used to select interrupts that require urgent service over less urgent interrupt requests.
  • Page 366 0 - 255. Vector numbers 1 and 3 through 14 are reserved for future use. Vector number 0 (ExtINT) is used to service Intel 8259A-compatible external interrupt controllers. Vector number 2 is used for the Non-Maskable Interrupt (NMI). The remaining 240 external interrupt vector numbers (16 through 255) are available for general operating system use.
  • Page 367 Table 5-8. Interrupt Priorities, Enabling, and Masking Interrupt Priority Vector Interrupt Unmasked Priority Interrupt Delivery Class Number Condition Enabled Highest INIT if PSR.mc is 0 Always 0..3 if PSR.ic is 1 Always 2 (NMI) if PSR.i is 1 Interrupt is higher priority than all in-service external interrupts 0 (ExtINT) TPR.mmi is 0, and interrupt is...
  • Page 368 The processor provides nested interrupt priority support for external interrupt vectors 0, 2, and 16 through 255 by: • Automatically masking external interrupts of equal or lower priority than the highest priority external interrupt currently in-service. This raises the in-service external interrupt masking level when each external interrupt begins service by an IVR read.
  • Page 369 ssm PSR.i srlz.d // external interrupts may be sampled anywhere here rsm PSR.i The stop following the srlz.d instruction in the above code sequence is required to force the Reset System Mask (rsm) instruction into a subsequent instruction group. The stop guarantees that the srlz.d will open the external interrupt window for at least one cycle before the rsm instruction closes it again.
  • Page 370 Table 5-9. External Interrupt Control Registers Register Name Description CR64 Local ID CR65 External Interrupt Vector Register (read only) CR66 Task Priority Register CR67 End Of External Interrupt CR68 IRR0 External Interrupt Request Register 0 (read only) CR69 IRR1 External Interrupt Request Register 1 (read only) CR70 IRR2 External Interrupt Request Register 2 (read only)
  • Page 371 IVR is a read-only register; writes to IVR result in a Illegal Operation fault. IVR reads do not issue an external INTA cycle. If the interrupt vector must be acquired from an Intel 8259A-compatible external interrupt controller, software should perform a load from the INTA byte. See “Interrupt Acknowledge (INTA) Cycle”...
  • Page 372 PSR.up is set to 1, potentially enabling performance monitor interrupts, and the new priority levels need to be in place before this enabling, a data serialization must be performed. (Note that there's no dependence between writing TPR and then changing the PSR for any other bits in the PSR than these.) A data serialization operation must be performed after TPR is written and before IVR is read to ensure that the reported IVR vector is correctly masked.
  • Page 373 5.8.3.5 External Interrupt Request Registers (IRR0-3 – CR68,69,70,71) Four 64-bit read-only External Interrupt Request Registers (IRR0-3, see Figure 5-10) provide the capability for software to determine the set of pending asynchronous external interrupts. IRR0 contains vectors <63:0> where vector 0 is in bit position 0, IRR1 contains vectors <127:64>, IRR2 contains vectors <191:128>, and IRR3 contains vectors <255:192>.
  • Page 374 5.8.3.7 Performance Monitoring Vector (PMV – CR73) PMV specifies the external interrupt vector number for Performance Monitoring overflow interrupts. To ensure that subsequent performance monitor interrupts reflect the new state of PMV by a given point in program execution, software must perform a data serialization operation after a PMV write and prior to that point.
  • Page 375 INIT – pend an Initialization Interrupt for system firmware. The vector field is ignored. reserved ExtINT – pend an Intel 8259A-compatible interrupt. This interrupt is delivered at external interrupt vector number 0. For details on servicing ExtINT external interrupts see “Interrupt Acknowledge (INTA) Cycle”...
  • Page 376 Figure 5-15. Processor Interrupt Block Memory Layout +0x1FFFFF Undefined ..+0x1E0008 Undefined INTA +0x1E0000 Undefined +0x100000 ....... +0x000020 +0x000018 +0x000010 +0x000008 +0x000000 ib_base The Inter-Processor Interrupt region occupies the lower half of the Processor Interrupt Block; by default its physical address range is 0x0000 0000 FEE0 0000 through 0x0000 0000 FEEF FFFF.
  • Page 377 INIT – pend an Initialization Interrupt for platform firmware on the processor listed in the destination. The vector field is ignored. Reserved ExtINT – pend an Intel 8259A-compatible interrupt. This interrupt is delivered at external interrupt vector number 0. For details on servicing ExtINT external interrupts see “Interrupt Acknowledge (INTA) Cycle”...
  • Page 378 The INTA Byte is located within the upper half of the Processor Interrupt Block, at offset 0x1E0000 from the base. A single byte load from the INTA address causes the processor to emit the INTA cycle on the processor system bus. An Intel 8259A-compatible external interrupt controller must respond with the actual interrupt vector number as the data to be loaded.
  • Page 379 processor does not interpret any data stored to the XTP Byte address and all data bits are passed to the external system unmodified. Any memory operation to the XTP address other than a single byte store is undefined. XTPR is written by operating system code to notify the system that the processor’s current task priority has been changed.
  • Page 380 2:132 Volume 2, Part 1: Interruptions...
  • Page 381 Register Stack Engine The register stack engine (RSE) moves registers between the register stack and the backing store in memory without explicit program intervention. The RSE operates concurrently with the processor and can take advantage of unused memory bandwidth to dynamically issue register spill and fill operations. In this manner, the latency of register spill/fill operations can be overlapped with useful program work.
  • Page 382 a stacked register from the backing store it also fills the register’s NaT bit. Whenever bits 8:3 of the RSE backing store load pointer are all ones, the RSE reloads a NaT collection from the backing store. Bit 63 of the NaT collection is ignored when read from the backing store.
  • Page 383 The RSE operates concurrently and asynchronously with respect to instruction execution by taking advantage of unused memory bandwidth to dynamically perform register spill and fill operations. The algorithm employed by the RSE to determine whether and when to spill/fill is implementation dependent. Software can not depend on the spill/fill algorithm.
  • Page 384 Table 6-1. RSE Internal State (Continued) Name Description Corresponds To RSE.ndirty Number of dirty registers on the register stack RSE.ndirty_words Number of dirty words on the register stack plus AR[BSP] - corresponding number of NaT collection AR[BSPSTORE] registers Register Stack Partitions The processor’s physical register file provides at least 96 stacked registers.
  • Page 385 Figure 6-3. Four Partitions of the Register Stack Invalid Physical Stacked Registers RSE.LoadReg RSE.StoreReg RSE.BOF CFM.sof Clean Dirty Current RSE Store return, rfi call, cover return, rfi, alloc RSE Load Higher Addresses RSE.BspLoad AR[BSPSTORE] AR[BSP] Backing Store The boundaries between the four register stack partitions are defined by the current frame marker (CFM) and three physical register numbers: a load, store and bottom-of-frame register number.
  • Page 386 place at lower addresses, defined relative to BSP by the sizes of the clean and dirty partitions. Although the stack is conceptually infinite in both directions, the effective base of the stack is expected to be the first memory location of the first page allocated to the backing store.
  • Page 387 RSE Control The RSE can be controlled at all privilege levels by means of three instructions (cover, flushrs, and loadrs) and by accessing four application registers (mov to/from RSC, BSP, BSPSTORE and RNAT). This section first presents each of the RSE application registers, and then discusses the three RSE control instructions.
  • Page 388 Protection is also checked based on the current entries in the data TLB. The RSE always remains coherent with respect to the data TLB. If a translation that is being used by the RSE is changed or purged, the RSE will immediately begin using the new translation or suffer a TLB miss.
  • Page 389 6.5.3 Backing Store Pointer Application Registers The RSE defines two Backing Store Pointer application registers: BSPSTORE and BSP. Since the RSE backing store pointers are always 8-byte aligned, bits {2:0} of the backing store pointers always read as zero. When writing the BSPSTORE application register, bits {2:0} in the presented address are ignored.
  • Page 390 Table 6-4. Backing Store Pointer Application Registers Instruction Affected State Read BSP Read BSPSTORE Write BSPSTORE mov r =AR[BSP] mov r =AR[BSPSTORE] mov AR[BSPSTORE]=r GR[r AR[BSP] AR[BSPSTORE] AR[BSP]{63:3} Unchanged Unchanged (GR[r ]{63:3} + RSE.ndirty) + ((GR[r ]{8:3} + RSE.ndirty)/63) AR[BSPSTORE]{63:3} Unchanged Unchanged GR[r...
  • Page 391 Table 6-5. RSE Control Instructions Instruction Affected State cover flushrs loadrs AR[BSP]{63:3} AR[BSP]{63:3}+ CFM.sof + Unchanged Unchanged (AR[BSP]{8:3} + CFM.sof)/63 AR[BSPSTORE]{63:3} Unchanged AR[BSP]{63:3} AR[BSP]{63:3} - AR[RSC].loadrs{13:3} RSE.BspLoad{63:3} Unchanged Model specific AR[BSP]{63:3} - AR[RSC].loadrs{13:3} AR[RNAT] Unchanged Updated UNDEFINED RSE.RNATBitIndex Unchanged AR[BSPSTORE]{8:3} AR[BSPSTORE]{8:3} CR[IFS] if (PSR.ic == 0) {...
  • Page 392 • The CFM (after the return) is forced to zero; i.e., all CFM fields (including CFM.sof and CFM.sol) are set to zero. • The registers from the returned-from frame and the preserved registers from the returned-to frame are added to the invalid partition of the register stack. •...
  • Page 393 frame of the target instruction. When RSE.CFLE is set, instruction execution is stalled until the RSE has completely restored the current frame or an interruption occurs. This is the only time that the RSE issues any memory traffic for the current frame. Interruption delivery clears RSE.CFLE which allows an interruption handler to execute in the presence of an incomplete frame (e.g., to handle the fault raised by the mandatory RSE load).
  • Page 394 RSE Behavior on Interruptions When the processor raises an interruption, the current register stack frame remains unchanged. If PSR.ic is one, the valid bit in the Interruption Function State register (IFS.v) is cleared. When the IFS.v bit is clear, the contents of the interruption frame marker field (IFS.ifm) are undefined.
  • Page 395 current frame again (either via another alloc instruction, or via a br.ret or rfi to a previous frame that contained that register), the value stored in the register, the NaT bit for the register, and the corresponding ALAT entry for the register remain undefined. RSE stores do not invalidate ALAT entries.
  • Page 396 3. Non-preemptive, synchronous backing store switch (covers system calls, user-level thread and operating system context switches) Failure to follow these sequences may result in undefined RSE and processor behavior. 6.11.1 Switch from Interrupted Context To switch from the backing store of an interrupted context to a new backing store: 1.
  • Page 397 1. Read and save the RSC, BSP and PFS application registers. 2. Issue a flushrs instruction to flush the dirty registers to the backing store. 3. Place RSE in enforced lazy mode by clearing both RSC.mode bits. 4. Read and save the RNAT application register. 5.
  • Page 398 2:150 Volume 2, Part 1: Register Stack Engine...
  • Page 399 Debugging and Performance Monitoring Processors based on the Itanium architecture provide comprehensive debugging and performance monitoring facilities for both IA-32 and Itanium instructions. This chapter describes the debug registers, performance monitoring registers and their programming models. The debugging facilities include several data and instruction break point registers, single step trap, breakpoint instruction fault, taken branch trap, lower privilege transfer trap, instruction and data debug faults.
  • Page 400 reference that matches the parameters specified by the IBR registers results in an IA_32_Exception(Debug) fault. If PSR.id is 1 or EFLAG.rf is 1, IA-32 Instruction Debug faults are disabled for one instruction. The successful execution of an IA-32 instruction clears the PSR.id and EFLAG.rf bits. •...
  • Page 401 Instruction/Data TLB Miss fault. If DBR.r and DBR.w are both 0, that data breakpoint register is disabled. Execute match enable – When IBR.x is 1, execution of an IA-32 instruction or Intel Itanium instruction in a bundle at an address matching the corresponding address register causes a breakpoint.
  • Page 402 Changes to debug registers and PSR are not necessarily observed by following instructions. Software should issue a data serialization operation to ensure modifications to DBR, PSR.db, PSR.tb and PSR.lp are observed before a dependent instruction is executed. For register changes to IBR and PSR.db that affect fetching of subsequent instructions, software must issue an instruction serialization operation.
  • Page 403 • The cmp8xchg16 operands are treated as 16-byte datums for both read and write breakpoint matching, even though this instruction only reads 8 bytes. Address breakpoint Data Debug faults are not reported for the Flush Cache (fc, fc.i), regular_form probe, non-faulting lfetch, insert TLB (itc, itr), purge TLB (ptc, ptr), or translation access (thash, ttag, tak, tpa) instructions.
  • Page 404 Processor implementations may not populate the entire PMC/PMD register space. Reading of an unimplemented PMC or PMD register returns zero. Writes to unimplemented PMC or PMD registers are ignored; i.e., the written value is discarded. Writes to PMD and PMC and reads from PMC are privileged operations. At non-zero privilege levels, these operations result in a Privileged Operation fault, regardless of the register address.
  • Page 405 A counter overflow interrupt occurs when the counter wraps; i.e., a carry out from bit W-1 is detected. Counter overflow interrupts are edge-triggered; that is, the event of a counter incrementing and causing carry out from bit W-1 thus setting the overflow bit and the freeze bit, generates one PMU interrupt.
  • Page 406 Table 7-4. Generic Performance Counter Configuration Register Fields (PMC[4]..PMC[p]) (Continued) Field Bits Description Privileged monitor – When 0, the performance monitor is configured as a user monitor, and enabled by PSR.up. When PMC.pm is 1, the performance monitor is configured as a privileged monitor, enabled by PSR.pp, and the corresponding PMD can only be read by privileged software.
  • Page 407 Table 7-5. Reading Performance Monitor Data Registers (Continued) PSR.sp PMC[i].pm PSR.cpl PMD Reads Return >0 >0 >0 Generic PMD counter registers may be read by software without stopping the counters. Under normal counting conditions (PMC[0].fr is zero and has been serialized), the processor guarantees that a sequence of reads of a given PMD will return non-decreasing values corresponding to the program order of the reads.
  • Page 408 7.2.2 Performance Monitor Overflow Status Registers (PMC[0]..PMC[3]) Performance monitor interrupts may be caused by an overflow from a generic performance monitor or an implementation-dependent event from a model-specific monitor. The four performance monitor overflow registers (PMC[0]...PMC[3]) shown in Figure 7-6 indicate which monitor caused the interruption.
  • Page 409 If control register bit PMV.m is one, a performance monitoring interrupt is disabled from being pended. When PMV.m is zero, the interruption is received and held pending. (Further masking by the PSR.i, TPR and in-service masking can keep the interrupt from being raised.) Figure 7-6 shows the Performance Monitor Overflow Status registers.
  • Page 410 Multiple overflow bits may be set to 1, if counters overflow concurrently. The overflow bits and the freeze bit are sticky; i.e., the processor sets them to 1 but never resets them to 0. It is software's responsibility to reset the overflow and freeze bits. The overflow status bits are populated only for implemented counters.
  • Page 411 follow the implementation-independent overflow interrupt service routine outlined in Figure 7-7. Use of alternate context-switch sequences may be incompatible with future implementations. If the outgoing context has an interrupt pending but has not yet invoked the performance monitor interrupt service routine, the interrupt may be delivered to the incoming context even if it is a non-monitored process.
  • Page 412 When switching back to the original context (that originally caused the counter overflow), the previously saved freeze bit can be inspected. If it was set (meaning there was a pending performance monitor interrupt), then the context switch routine posts an interrupt message to the incoming context’s processor at the performance monitor vector specified by the PMV register (see Section 10.5.8, “Inter-processor Interrupts Layout and Example”...
  • Page 413 Interruption Vector Descriptions Chapter 5 describes the interruption mechanism and programming model for the Itanium architecture. This chapter describes the IVA-based interruption handlers. “Interruption Vector Descriptions” describes all the Itanium IVA-based interruption vectors and “IA-32 Interruption Vector Definitions” describes all of the IA-32 interrupt vectors.
  • Page 414 Interruption Vector Definition Table 8-1.Writing of Interruption Resources by Vector IIP, IPSR, Interruption Resource ITIR IIB0, IIB1 IIPA, IFS.v PSR.ic at time of interruption Alternate Data TLB vector Alternate Data TLB fault IR Alternate Data TLB fault Alternate Instruction TLB vector Alternate Instruction TLB fault Break Instruction vector Break Instruction fault...
  • Page 415 Table 8-1.Writing of Interruption Resources by Vector (Continued) IIP, IPSR, Interruption Resource ITIR IIB0, IIB1 IIPA, IFS.v PSR.ic at time of interruption Reserved Register/Field fault Unimplemented Data Address fault IA-32 Exception vector IA-32 Intercept vector IA-32 Interrupt vector Instruction Access Rights vector Instruction Access Rights fault Instruction Access-Bit vector...
  • Page 416 Table 8-1.Writing of Interruption Resources by Vector (Continued) IIP, IPSR, Interruption Resource ITIR IIB0, IIB1 IIPA, IFS.v PSR.ic at time of interruption Unaligned Data Reference fault Unsupported Data Reference vector Unsupported Data Reference fault VHPT Translation vector IR VHPT Data fault VHPT Data fault VHPT Instruction fault Virtual External Interrupt vector...
  • Page 417 Table 8-2. ISR Values on Interruption (Continued) Vector / Interruption Instruction Debug fault IR Data Debug fault Dirty-Bit vector Data Dirty Bit fault Disabled FP-Register vector Disabled Floating-Point Register fault External Interrupt vector External Interrupt Floating-point Fault vector Floating-Point Exception fault Floating-point Trap vector Floating-Point Exception trap General Exception vector...
  • Page 418 Software must look at the ISR.code bit vector to determine if any lower priority trap occurred at the same time as the trap being processed. ® ® Table 8-3. ISR.code Fields on Intel Itanium Traps Field Description Floating-Point Exception trap...
  • Page 419 ® ® Table 8-3. ISR.code Fields on Intel Itanium Traps (Continued) Field Description Taken Branch trap Single Step trap Unimplemented Instruction Address trap fp trap code IEEE O (overflow) exception (Parallel FP-LO) fp trap code IEEE U (underflow) exception (Parallel FP-LO)
  • Page 420 Table 8-4. Interruption Vectors Sorted Alphabetically (Continued) Vector Name Offset Page Unsupported Data Reference 0x5b00 2:202 vector VHPT Translation vector 0x0000 2:173 Virtual External Interrupt vector 0x3400 2:187 Virtualization vector 0x6100 2:209 2:172 Volume 2, Part 1: Interruption Vector Descriptions...
  • Page 421 VHPT Translation vector (0x0000) Name Cause The hardware VHPT walker encountered a TLB miss while attempting to reference the virtually addressed hashed page table for a memory reference (including IA-32). Interruptions on this vector: IR VHPT Data fault VHPT Instruction fault VHPT Data fault Parameters IIP, IPSR, IIPA, IFS –...
  • Page 422 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 0 ni 0 0 0 0 0 0 1 Notes This fault can only occur when PSR.ic is 1 or in-flight, and the VHPT walker is enabled...
  • Page 423 Instruction TLB vector (0x0400) Name Cause The instruction TLB entry needed by an instruction fetch (including IA-32) is absent, and the hardware VHPT walker could not find the translation in the VHPT, or the hardware VHPT walker is enabled but not implemented on this processor. Interruptions on this vector: Instruction TLB fault Parameters...
  • Page 424 Data TLB vector (0x0800) Name Cause For memory references (including IA-32), the data TLB entry needed by the data access is absent, and the hardware VHPT walker could not find the translation in the VHPT, or the hardware VHPT walker is not implemented on this processor. Interruptions on this vector: IR Data TLB fault Data TLB fault...
  • Page 425 Alternate Instruction TLB vector (0x0c00) Name Cause The instruction TLB entry needed by an instruction fetch (including IA-32) is absent, and the hardware VHPT walker was not enabled for this address. Interruptions on this vector: Alternate Instruction TLB fault Parameters IIP, IPSR, IIPA, IFS –...
  • Page 426 Alternate Data TLB vector (0x1000) Name Cause For memory references (including IA-32), the data TLB entry needed by data access is absent, and the hardware VHPT walker was not enabled for this address. Interruptions on this vector: IR Alternate Data TLB fault Alternate Data TLB fault Parameters IIP, IPSR, IIPA, IFS –...
  • Page 427 Data Nested TLB vector (0x1400) Name Cause For memory references, the data TLB entry needed for a data reference is absent and PSR.ic is 0. Note: Data Nested TLB faults cannot occur during IA-32 instruction set execution, since PSR.ic must be 1. Interruptions on this vector: IR Data Nested TLB fault Data Nested TLB fault...
  • Page 428 Instruction Key Miss vector (0x1800) Name Cause For instruction fetches (including IA-32), the PSR.it bit is 1, the PSR.pk bit is 1, and the access key from the TLB entry for the address of the executing instruction bundle does not match any of the valid protection keys. Interruptions on this vector: Instruction Key Miss fault Parameters...
  • Page 429 Data Key Miss vector (0x1c00) Name Cause For memory references (including IA-32), the PSR.dt bit is 1, the PSR.pk bit is 1, and the access key from the TLB entry for the address referenced by a load, store, probe (regular_form probe or probe.fault) or semaphore operation does not match any of the valid protection keys.
  • Page 430 Dirty-Bit vector (0x2000) Name Cause IA-32 or Itanium store or semaphore operations to a page with the dirty-bit (TLB.d) equal to 0 in the data TLB. Interruptions on this vector: Data Dirty Bit fault Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description.
  • Page 431 Instruction Access-Bit vector (0x2400) Name Cause For instruction fetches (including IA-32), the access bit (TLB.a) in the TLB entry for this page is 0, and an instruction on the page is referenced. Interruptions on this vector: Instruction Access Bit fault Parameters IIP, IPSR, IIPA, IFS –...
  • Page 432 Data Access-Bit vector (0x2800) Name Cause For data memory references (including IA-32), the access bit (TLB.a) in the TLB entry for this page is 0, and the page is referenced. Interruptions on this vector: IR Data Access Bit fault Data Access Bit fault Parameters IIP, IPSR, IIPA, IFS –...
  • Page 433 Break Instruction vector (0x2c00) Name Cause An attempt is made to execute an Itanium break instruction. Interruptions on this vector: Break Instruction fault Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description. IIM – Is updated with the break instruction immediate value. IIB0, IIB1 –...
  • Page 434 External Interrupt vector (0x3000) Name Cause There are unmasked external interrupts pending from external devices, other processors, or internal processor events and: • PSR.i is 1, while executing Itanium instructions • PSR.i is 1 and (CFLAG.if is 0 or EFLAG.if is 1), while executing IA-32 instructions IPSR.is indicates which instruction set was executing at the time of the interruption.
  • Page 435 Virtual External Interrupt vector (0x3400) Name Cause The guest highest pending interrupt (GHPI) specified by the VMM is unmasked on the virtual processor. IPSR.is indicates which instruction set was executing at the time of the interruption. Interruptions on this vector: Virtual External Interrupt Parameters IIP, IPSR, IIPA, IFS –...
  • Page 436 Page Not Present vector (0x5000) Name Cause The bundle or IA-32 instruction being executed resides on a page for which the P-bit (TLB.p) in the instruction TLB entry is 0, or the data being referenced resides on a page for which the P-bit in the data TLB entry is 0. Interruptions on this vector: IR Data Page Not Present fault Instruction Page Not Present fault...
  • Page 437 Key Permission vector (0x5100) Name Cause Data access (including IA-32): The PSR.dt bit is 1, the PSR.pk bit is 1 and read or write permission is disabled by the matching protection register on a load, store, or semaphore operation. The RSE may cause this fault if PSR.rt is 1, the PSR.pk bit is 1 and read or write permission is disabled by the matching protection register on an RSE mandatory load/store operation.
  • Page 438 Instruction Access Rights vector (0x5200) Name Cause For instruction fetches (including IA-32), the PSR.it bit is 1, and the access rights for this page do not allow execution or do not allow execution at the current privilege level. Interruptions on this vector: Instruction Access Rights fault Parameters IIP, IPSR, IIPA, IFS –...
  • Page 439 Data Access Rights vector (0x5300) Name Cause For memory references (including IA-32), the PSR.dt bit is 1, and the access rights for this page do not allow read access or do not allow read access at the current privilege level for load and semaphore operations. The PSR.dt bit is 1, and the access rights for this page do not allow write access or do not allow write access at the current privilege level for store and semaphore operations.
  • Page 440 General Exception vector (0x5400) Name Cause An attempt is being made to execute an illegal operation, privileged instruction, access a privileged register, unimplemented field, unimplemented register, unimplemented address, or take an inter-instruction set branch when disabled. Interruptions on this vector: IR Unimplemented Data Address fault Illegal Operation fault Illegal Dependency fault...
  • Page 441 • If the instruction has two PR targets, and specifies the same PR for both, predicated-off unconditional compare, fclass, tbit, tnat, and tf instructions take this fault, even when their qualifying predicate is zero. • Register bank conflict on a floating-point load pair instruction. •...
  • Page 442 • ISR.code{7:4} = 4: Disabled Instruction Set Transition fault. An instruction set transition was attempted while PSR.di was 1. This fault can be raised by either the Itanium br.ia instruction or the IA-32 jmpe instruction. IPSR.is indicates the faulting instruction set. •...
  • Page 443 Disabled FP-Register vector (0x5500) Name Cause An attempt is made to reference a floating-point register set that is disabled. When PSR.dfl is 1, execution of any IA-32 FP, SSE or MMX technology instructions raises a Disabled FP Register Low Fault (regardless of whether FR2 - FR31 are actually referenced).
  • Page 444 NaT Consumption vector (0x5600) Name Cause A non-speculative operation (including IA-32) (e.g., load, store, control register access, instruction fetch etc.) read a NaT source register, NaTVal source register, or referenced a NaTPage. Interruptions on this vector: IR Data NaT Page Consumption fault Instruction NaT Page Consumption fault Register NaT Consumption fault Data NaT Page Consumption fault...
  • Page 445 behavior of NaT and NaTVal values is model specific, see Section 6.2.4.3, “NaT/NaTVal Response for IA-32 Instructions” on page 1:134 for details. • ISR – The value for the ISR bits depend on the type of access performed and are specified below.
  • Page 446 Speculation vector (0x5700) Name Cause A chk.a, chk.s, or fchkf instruction needs to branch to recovery code, and the branching behavior is unimplemented by the processor. This fault cannot be raised by IA-32 instructions. Interruptions on this vector: Speculative Operation fault Parameters IIP, IPSR, IIPA, IFS –...
  • Page 447 The Speculative Operation fault handler does not need to check for unimplemented instruction addresses. They will be checked automatically by processor hardware when the handler executes its rfi. On processors which report unimplemented instruction addresses with an Unimplemented Instruction Address (UIA) trap, if an emulated check instruction targets an unimplemented address and also needs to take a Single Step trap or Taken Branch trap (or both), the UIA trap will not be raised until after the Single Step and/or Taken Branch trap has been handled, making it appear that the Unimplemented...
  • Page 448 Debug vector (0x5900) Name Cause A debug fault has occurred. Either the instruction address matches the parameters set up in the instruction debug registers, or the data address of a load, store, semaphore, or mandatory RSE fill or spill matches the parameters set up in the data debug registers.
  • Page 449 Unaligned Reference vector (0x5a00) Name Cause If PSR.ac is 1, and the data address being referenced by an Itanium instruction is not aligned to the natural size of the load, store, or semaphore operation, or a data reference is made to a misaligned datum not supported by the implementation. “Memory Access Instructions”...
  • Page 450 Unsupported Data Reference vector (0x5b00) Name Cause An attempt was made to: • Execute a fetchadd, cmpxchg, xchg, or unsupported ld16, st16 or 10-byte memory reference (ldfe or stfe) instruction to a page that is neither cacheable with write-back write policy nor a NaTPage. •...
  • Page 451 Floating-point Fault vector (0x5c00) Name Cause A floating-point exception fault has occurred. IA-32 numeric instructions can not raise this fault, IA-32 floating point faults are delivered on the IA_32_Exception(Floating-Point) vector. Interruptions on this vector: Floating-Point Exception fault Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description.
  • Page 452 Floating-point Trap vector (0x5d00) Name Cause A floating-point exception trap has occurred. IA-32 numeric instructions can not raise this trap. Interruptions on this vector: Floating-Point Exception trap Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description. IIB0, IIB1 –...
  • Page 453 Lower-Privilege Transfer Trap vector (0x5e00) Name Cause Two trapping conditions transfer control to this vector: • An attempt is made to transfer control to an unimplemented address, resulting in either an Unimplemented Instruction Address trap or an Unimplemented Instruction Address fault. See “Unimplemented Address Bits”...
  • Page 454 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 0 0 0 ss tb 1 0 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 0 ni ir 0 0 0 0 0 0 Notes The Unimplemented Instruction Address trap can be the result of a taken branch, a...
  • Page 455 Taken Branch Trap vector (0x5f00) Name Cause A taken branch was executed, and the PSR.tb bit is 1. IA-32 instructions can not raise this trap, IA-32 taken branch traps are delivered on the IA_32_Exception(Debug) vector. The Taken Branch trap is not taken on an rfi instruction. Interruptions on this vector: Taken Branch trap Parameters...
  • Page 456 Single Step Trap vector (0x6000) Name Cause An instruction was successfully executed, and the PSR.ss bit is 1. For IA-32 instruction set, this condition is delivered on the IA_32_Exception(Debug) vector; see Chapter 9, “IA-32 Interruption Vector Descriptions.” IA-32 instructions can not raise this trap, IA-32 single step events are delivered on the IA_32_Exception(Debug) vector.
  • Page 457 Virtualization vector (0x6100) Name Cause An attempt is made to execute an instruction which requires virtualization. This fault cannot be raised by IA-32 instructions. Interruptions on this vector: Virtualization fault Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description.
  • Page 458 IA-32 Exception vector (0x6900) Name Cause A fault or trap was raised while executing from the IA-32 instruction set. Interruptions on this vector: IA-32 Instruction Debug fault IA-32 Code Fetch fault IA-32 Instruction Length > 15 bytes fault IA-32 Device Not Available fault IA-32 FP Error fault IA-32 Segment Not Present fault IA-32 Stack Exception fault...
  • Page 459 IA-32 Intercept vector (0x6a00) Name Cause An intercept fault or trap was raised while executing from the IA-32 instruction set. This vector handles all the IA-32 intercepts described in Chapter 9, “IA-32 Interruption Vector Descriptions.” Interruptions on this vector: IA-32 Invalid Opcode fault IA-32 Instruction Intercept fault IA-32 Locked Data Reference fault IA-32 System Flag Intercept trap...
  • Page 460 IA-32 Interrupt vector (0x6b00) Name Cause An IA-32 software interrupt trap was executed. This vector handles all the IA-32 software interrupts described in Chapter 9, “IA-32 Interruption Vector Descriptions.” Interruptions on this vector: IA-32 Software Interrupt (INT) trap Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description.
  • Page 461 EFLAG.tf is 1. b0 to b3 Data breakpoint trap due to a match with the corresponding Intel Itanium data breakpoint registers. Each bit indicates a match with the corresponding DBR registers; b0=DBR0/1, b1=DBR2/3, b2=DBR4/5, b3=DBR6/7. Zero, one or more bits may be set.
  • Page 462 IA_32_Exception (Divide) – Divide Fault Name ® Cause IA-32 IDIV or DIV instruction attempted a divide by zero operation. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault. Parameters IIP – virtual IA-32 instruction address zero extended to 64-bits.
  • Page 463 The Itanium architecture debug facilities triggered an IA-32 code breakpoint fault on a ® IA-32 instruction fetch and PSR.id and EFLAG.rf are 0. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault.
  • Page 464 In the Itanium System Environment, IA-32 Mov SS or Pop SS single step and data breakpoint traps are NOT deferred to the next instruction. Refer ® to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this trap.
  • Page 465 IA_32_Exception (Break) – INT 3 Trap Name ® Cause IA-32 breakpoint instruction (INT 3) triggered a trap. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this trap. Parameters IIPA – trapping virtual IA-32 instruction address zero extended to 64-bits.
  • Page 466 IA_32_Exception (Overflow) – Overflow Trap Name ® Cause IA-32 INTO instruction execution when EFLAG.of is set to one. Refer to the Intel and IA-32 Architectures Software Developer’s Manual for a complete definition of this trap. Parameters IIPA – trapping virtual IA-32 instruction address zero extended to 64-bits.
  • Page 467 IA_32_Exception (Bound) – Bounds Fault Name ® Cause Failed IA-32 Bound check instruction. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault. Parameters IIP – virtual IA-32 instruction address zero extended to 64-bits.
  • Page 468 IA_32_Exception (InvalidOpcode) – Invalid Opcode Fault Name Cause All IA-32 invalid opcode faults are delivered to the IA_32_Intercept(Instruction) handler, including IA-32 illegal, unimplemented opcodes, MMX technology and SSE instructions if CR0.EM is 1, and SSE instructions if CR4.fxsr is 0. All illegal IA-32 floating-point opcodes result in an IA_32_Intercept(Instruction) regardless of the state of CR0.em.
  • Page 469 The processor executed an IA-32 ESC or floating-point instruction with CR0.em is 1. Or an IA-32 WAIT, ESC, floating-point instruction, MMX technology or SSE instruction is executed and CR0.ts bit is 1. ® Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault. Parameters IIP –...
  • Page 470 Double Fault Name Cause IA-32 Double Faults (IA-32 vector 8) are not generated by the processor in the Itanium System Environment. 2:222 Volume 2, Part 1: IA-32 Interruption Vector Descriptions...
  • Page 471 Invalid TSS Fault Name Cause IA-32 Invalid TSS Faults (IA-32 vector 10) are not generated in the Itanium System Environment. Volume 2, Part 1: IA-32 Interruption Vector Descriptions 2:223...
  • Page 472 IIPA – virtual address of the faulting IA-32 instruction zero extended to 64-bits. ISR.vector – 11. ® ISR.code – IA-32 defined error code. See Intel 64 and IA-32 Architectures Software Developer’s Manual. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9...
  • Page 473 IA-32 defined set of stack segment fault conditions detected during stack segment load ® operations or memory references relative to the stack segment, refer to the Intel and IA-32 Architectures Software Developer’s Manual for a complete list of all IA-32 faulting conditions. Stack faults can also be generated when the processor detects an inconsistent stack segment register descriptor value during an IA-32 stack reference instruction (e.g.
  • Page 474 IA-32 defined set of data and code segment fault conditions detected during data or code segment load operations or memory references relative to code or data segments, ® refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete list of all IA-32 General Protection Fault conditions. General Protection faults...
  • Page 475 Page Fault Name Cause IA-32 defined page faults (IA-32 vector 14) can not be generated in the Itanium System Environment. Volume 2, Part 1: IA-32 Interruption Vector Descriptions 2:227...
  • Page 476 Itanium System Environment. IA-32 numeric exception delivery is not triggered by Itanium numeric exceptions or the execution of Itanium numeric instructions. Refer to ® the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault.
  • Page 477 An IA-32 instruction performed an unaligned data memory reference while PSR.ac is 1, or EFLAG.ac is 1 and CR0.am is 1 and the effective privilege level is 3. Refer to the ® Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault.
  • Page 478 Machine Check Name Cause IA-32 Machine Check (IA-32 vector 18) is not generated in the Itanium System Environment. 2:230 Volume 2, Part 1: IA-32 Interruption Vector Descriptions...
  • Page 479 SSE instruction. SSE instructions do NOT trigger the report of any pending IA-32 floating-point exceptions. SSE instructions ® always ignore CR0.ne and the IGNNE pin. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault.
  • Page 480 IA_32_Interrupt (Vector #N) – Software Trap Name Cause The IA-32 INT n instruction forces an IA-32 interrupt trap. The IA-32 IDT is not consulted nor are any values pushed onto a memory stack. Parameters IIPA – trapping virtual IA-32 instruction address (points to the INT instruction) zero extended to 64-bits.
  • Page 481 INT1, SIDT, SGDT, SLDT, SMSW, WBINVD, WRMSR, and all other unimplemented and illegal opcode patterns. If CR0.em is 1, execution of all IA-32 Intel MMX technology and IA-32 SSE instructions results in this intercept. If CR4.FXSR is 0, execution of all IA-32 SSE instructions results in this intercept.
  • Page 482 Figure 9-3. IA-32 Intercept Code 15 14 13 12 11 10 9 sp np rp lp as os 0 Table 9-1. Intercept Code Definition Field Bits Description Operand Size – (OperandSize Prefix XOR CSD.d bit). When 1, indicates the effective operand size is 32-bits, when 0, 16-bits. Address Size –...
  • Page 483 IA_32_Intercept (Gate) – Gate Intercept Trap Name Cause If an IA-32 control transfer is initiated through a GDT/LDT descriptor that transfers control through a Call Gate, Task Gate or Task Segment this interception trap is generated. Parameters IIPA – trapping virtual IA-32 instruction address zero extended to 64-bits. IIP –...
  • Page 484 IA_32_Intercept (SystemFlag) – System Flag Trap Name Parameters System Flag Intercept Traps are generated for the following conditions: CLI, STI, POPF, POPFD instructions. If the EFLAG.if bit changes state and CFLG.ii is 1, or EFLAG.tf or EFLAG.ac change state, a System Flag intercept notification trap is delivered after the instruction completes.
  • Page 485 IA_32_Intercept (Lock) – Locked Data Reference Fault Name Cause For IA-32 locked operations, if the DCR.lc bit is 1, and an atomic operation to made to non-write-back memory or to unaligned write-back memory that would result in a read-modify-write sequence being performed externally under an external bus lock, the processor raises a Locked Data Reference fault.
  • Page 486 2:238 Volume 2, Part 1: IA-32 Interruption Vector Descriptions...
  • Page 487 ® Itanium Architecture-based Operating System Interaction Model with IA-32 Applications This section describes the IA-32 system execution model from the perspective of an Itanium architecture-based operating system interfacing with IA-32 code, while operating in the Itanium System Environment. The main features covered are: •...
  • Page 488 Control Registers unmodified, Controls instruction set execution (including IA-32) shared IFA, IIP, Intel Itanium interruption registers may be overwritten on IPSR, ISR, any TLB fault, interruption or exception encountered IIM, IIPA, during IA-32 or Intel Itanium instruction set execution. shared...
  • Page 489 When Itanium architecture-based software loads these registers, no data integrity checks are performed at that time if illegal values are loaded in any fields. For a ® complete definition of all bit fields and field semantics refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual.
  • Page 490 The TSSD descriptor points to the I/O Permission Bitmap. If CFLG.io is 1, IN, INS, OUT, ® and OUTS consult the TSSD I/O permission bitmap as defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual. If CFLG.io is 0, the TSSD I/O permission bitmap is not checked.
  • Page 491 10.3.1 IA-32 Current Privilege Level PSR.cpl is the current privilege level of the processor for instruction execution (including IA-32). PSR.cpl is used by the processor for all IA-32 descriptor segmentation and paging permission checks. PSR.cpl is a secured register. Typical IA-32 processors used SSD.dpl as the official privilege level of the processor.
  • Page 492 If CFLG.ii is 1, successful modification of the IF-bit by CLI, STI, or POPF results in an IA_32_Intercept(SystemFlag) trap, otherwise the IF-bit is modified without interception. Modification of this bit by Intel Itanium instructions does not result in an ®...
  • Page 493 13:12 IA-32 In/Out Privilege Level, controls accessibility by IA-32 IN/OUT instructions to the I/O port space and permission to modify the IF-bit for Intel Itanium and IA-32 instructions. If PSR.cpl > IOPL, permission is denied for IA-32 IN/OUT instructions, and modifications of EFLAG.if by either IA-32 or Intel Itanium instructions are ignored.
  • Page 494 64 and IA-32 Architectures Software Developer’s Manual for details. Affects execution of POPF, PUSHF, CLI and STI. This bit is supported in both the IA-32 and Intel Itanium System Environments. A IA-32 Code Fetch fault (GPFault(0)) is generated on every IA-32 instruction (including the target of rfi and br.ia), if the following condition is true:...
  • Page 495 CFLG.mp is 1, execution of IA-32 FWAIT/WAIT instructions results in an IA_32_Exception(DNA) fault. This bit is ignored by Intel Itanium instructions. This bit is supported in both the IA-32 and Intel ® Itanium System Environments. See Intel 64 and IA-32 Architectures Software Developer’s...
  • Page 496 CR0.NE CFLG.ne Numeric Error: Numeric errors are always enabled in the Intel Itanium System Environment. The NE bit and the IGNNE# pin are ignored by the processor and the FERR# pin is not asserted for any numeric errors on IA-32 or Intel Itanium floating-point instructions.
  • Page 497 Itanium architecture-based code does NOT have any side effects such as flushing the ® TLBs. This bit is supported as defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual for the IA-32 System Environment.
  • Page 498 IA-32 Architectures Software Developer’s Manual for the IA-32 System Environment. CR4.PGE CFLG.pge Paging Global Enable: This bit is ignored in the Intel Itanium System Environment. This bit is provided as storage for compatibility purposes. This bit is ® supported as defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual for...
  • Page 499 CR4.pce is 1. Otherwise execution of the RDPMC instruction results in a GPFault. CFLG.pce is ignored by Intel Itanium instructions. This bit is supported in both the IA-32 and Intel ® Itanium System Environments. See the Intel and IA-32 Architectures Software Developer’s Manual for details on these bits.
  • Page 500 10.3.3.3 IA-32 Memory Type Range Registers (MTRRs) Within the Itanium System Environment, IA-32 MTRR registers are superseded by physical memory attributes supplied by the TLB, as defined in Section 4.4.3, “Cacheability and Coherency Attribute” on page 2:77. IA-32 instruction references to the MTRRs in the MSR register space results in an instruction intercept fault.
  • Page 501 Table 10-5 summarizes IA-32 instruction behavior within the Itanium System ® Environment. All IA-32 instructions are unchanged from the Intel 64 and IA-32 Architectures Software Developer’s Manual except where noted. IA-32 instructions can also generate additional Itanium register and memory faults as defined in ®...
  • Page 502 ® Table 5-6. Please refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for the behavior of all IA-32 instructions in the IA-32 System Environment. For all listed and unlisted IA-32 instructions in Table 10-5 the following relationships hold: •...
  • Page 503 Table 10-5. IA-32 Instruction Summary (Continued) ® ® Intel Itanium System IA-32 Instruction Comments Environment CMPXCHG, 8B Optional Lock Intercept If Locks are disabled (DCR.lc is 1) and a processor external lock transaction is required CPUID CWD, CDQ CVTPI2PS, CVTPS2PI,...
  • Page 504 Table 10-5. IA-32 Instruction Summary (Continued) ® ® Intel Itanium System IA-32 Instruction Comments Environment F2XM1 FABS FADD, FADDP, FIADD FBLD FBSTP FCHS FCLEX, FNCLEX FCMOV FCOM, FCOMPP FCOMI, FCOMIP FUCOMI, FUCOMIP FCOS FDECSTP FDIV, FDIVP, FIDIV FDIVR, FDIVRP, FDIVR...
  • Page 505 IMUL IN, INS unchanged + I/O ports are If CFLG.io is 0, the TSS I/O permission bitmap is mapped virtually not consulted. Intel Itanium TLB faults control accessibility to I/O ports. unchanged INT 3, INTO Mandatory Exception vector Delivered as an IA_32_Interrupt...
  • Page 506 ORPS OUT, OUTS unchanged + I/O ports are If CFLG.io is 0, the TSS I/O permission bitmap is mapped virtually not consulted. Intel Itanium TLB faults control accessibility to I/O ports. PACKSS, PACKUS PADD, PADDS, PADDUS PAND, PANDN PCMPEQ, PCMPGT...
  • Page 507 Table 10-5. IA-32 Instruction Summary (Continued) ® ® Intel Itanium System IA-32 Instruction Comments Environment near: no change far: no change less privilege: no change same privilege: no change + additional taken branch trap If PSR.tb is 1, raise a taken branch trap.
  • Page 508 Zero Index tation Extend Displacement ® ® Intel Itanium Base 10.6.1 Virtual Memory References In the Itanium System Environment the following virtual memory options are available for supporting IA-32 and Itanium memory references. • Software TLB fills (TLBs are enabled, but the VHPT is disabled).
  • Page 509 10.6.2 IA-32 Virtual Memory References By definition, IA-32 instruction and data memory references are confined to 32-bits of virtual addressing, the first 4 G-bytes of virtual region 0. However, IA-32 memory references can be mapped anywhere within the implemented physical address space by operating system code.
  • Page 510 Figure 10-5. Physical Memory Addressing 64-bit 16/32-bit Physical Address Effective Address PA{63:32}=0 Base PA{31:0} IA-32 Segmen- Index tation Displacement PA{63:0} ® ® Intel Itanium Base ® 2:262 Volume 2, Part 1: Itanium Architecture-based Operating System Interaction Model with IA-32 Applications...
  • Page 511 10.6.6 Supervisor Accesses If the processor is operating in the Itanium System Environment, supervisor override is disabled, and LDT, GDT, TSS references are performed at the privilege level specified by PSR.cpl. Unaligned processor references to LDT, GDT, and TSS segments will never generate an EFLAG.ac enabled IA-32 Exception (AlignmentCheck) fault, even if PSR.cpl equals 3 and supervisor override is disabled.
  • Page 512 10.6.8 Atomic Operations All Itanium load/stores and IA-32 non-locked memory references up to 64-bits that are aligned to their natural data boundaries are atomic. Both IA-32 and Itanium atomic semaphore operations can be performed on the same shared memory location. The processor ensures IA-32 locked read-modify-write opcodes and Itanium semaphore operations are performed atomically even if the operations are initiated from the other instruction set by the same processors, or between multiple processors in an multiprocessing system.
  • Page 513 • All IA-32 read-modify-write or locked instructions have memory fence semantics. All buffered stores are flushed. ® • IA-32 IN, OUT and serializing operations (as defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual) have memory fence semantics.
  • Page 514 Table 10-7. IA-32 Load/Store Sequentiality and Ordering (Continued) IA-32 Memory Write Uncacheable Cacheable Reference Coalescing locked sequential non-sequential non-sequential or read-modify-write fence fence fence operation flush prior stores flush prior stores flush prior stores IN, INS, OUT, OUTS sequential undefined undefined fence flush prior stores...
  • Page 515 Itanium loads and stores by issuing an acquire operation (or mf) before the instruction set transition. ® ® 10.6.10.1.2 Transitions from IA-32 Instruction Set to Intel Itanium Instruction Set • All data dependencies are honored, Itanium loads see the results of all prior Itanium and IA-32 stores.
  • Page 516 Figure 10-1. I/O Port Space Model Virtual Address Space Physical Address Space Memory Mapped I/O Memory Map I/O IA-32/Intel® Itanium® Loads/Stores 64MB Platform I/O Ports IN/OUT I/O Ports 64MB IA-32 IN, OUT Platform Physical I/O Block IA-32/Intel® Itanium® Loads/Stores IOBase In the Itanium System Environment, the virtual location of the 64 MB I/O port space is determined by operating system.
  • Page 517 IA-32 Shift Port{15:2} I/O Port Left Number 12-bits Port{11:0} ® Intel ® Itanium I/O Port Load, Address Store For IA-32 IN and OUT instructions a port’s virtual address is computed as: port_virtual_address = IOBase | (port{15:2}<<12) | port{11:0} This address computation places 4 ports on each 4K page and expands the space to 64MB, with the ports being at a relative offset specified by port{11:0} within each 4K-byte virtual page.
  • Page 518 Operating System Warning: Operating system code can not remap a given port to another port address within the I/O port space, such that port_physical_address{21:12} != port_physical_address{11:2}. Otherwise, based on the processor model, I/O port data may be placed on the wrong bytes of the processor’s bus and the port will not be correctly accessed.
  • Page 519 10.7.3 IA-32 IN/OUT instructions ® IA-32 I/O instructions (IN, OUT, INS, OUTS) defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual are augmented as follows: • I/O instructions first check for IOPL permission. If PSR.cpl<=EFLAG.iopl, access permission is granted.
  • Page 520 • If data translations are disabled (PSR.dt is 0) or the referenced I/O port is mapped to an unimplemented virtual address (via the IOBase register), a GPFault is raised on the referencing IA-32 IN, OUT, INS, or OUTS instruction. • Alignment and Data Address breakpoints are also checked and may result in an IA_32_Exception(AlignmentCheck) fault (if PSR.ac is 1) or IA_32_Exception(Debug) trap.
  • Page 521 [mf] //Fence prior memory references, if required add port_addr = IO_Port_Base, Expanded_Port_Number ld.acq data, (port_addr) [mf.a] //Wait for platform acceptance, if required [mf] //Fence future memory references, if required 10.8 Debug Model The debug facilitates defined by the Itanium architecture are designed to support debugging of both the Itanium and IA-32 instruction set.
  • Page 522 10.8.1 Data Breakpoint Register Matching Each Itanium data breakpoint register has the following matching behavior for IA-32 instruction set data memory references: • DBR.addr IA-32 single or multi-byte data memory references that access ANY – memory byte specified by the DBR address and mask fields results in a debug breakpoint trap regardless of datum size and alignment.
  • Page 523 3) record the state of IA-32 execution at the point of interruption. For IA-32 exceptions, ISR contains IA-32 defined error codes and ® vector numbers as defined by the Intel 64 and IA-32 Architectures Software Developer’s Manual. IA-32 instruction set related exceptions and software...
  • Page 524 IA_32_Exception (Debug) TrapCode IA-32 debug events. The Trap Code indicates concurrent taken branch, data breakpoint and single step trap conditions. External Interrupt NMI is delivered through the Intel Itanium External Interrupt vector. IA_32_Exception(Break) TrapCode IA-32 INT 3 instruction. IA_32_Exception(INTO) TrapCode IA-32 INTO detected overflow trap.
  • Page 525 IA-32 numeric instructions follow the IA-32 delayed floating-point exception model. Specifically IA-32 numeric exceptions are held pending until the next IA-32 numeric or ® MMX technology instruction as defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual. Numeric faults generated on SSE instructions are reported precisely on the faulting SSE instruction.
  • Page 526 transactions. For IA-32 code, if the platform does not support LOCK or SPLCK, the operating system must disable external bus lock transactions by setting DCR.lc to 1. When DCR.lc is 1, any IA-32 atomic reference not serviced internally in the processor’s caches results in an IA_32_Intercept(Lock) fault.
  • Page 527 Processor Abstraction Layer This chapter defines the architectural requirements for the Processor Abstraction Layer (PAL) for all processors based on the Itanium architecture. It is intended for processor designers, firmware/BIOS designers, system designers, and writers of diagnostic and low level operating system software. PAL is part of the Itanium processor architecture and its goal is to provide a consistent firmware interface to abstract processor implementation-specific features.
  • Page 528 Figure 11-1. Firmware Model Operating System Software UEFI Power mgmt, OS Boot runtime hot-plug, Transfers Instruction services Handoff etc. to OS Execution entrypoints Unified Extensible Firmware Interface (UEFI) procedure calls OS Boot Interrupts, Advanced Selection traps, and Configuration faults System Abstraction Layer and Power Interface (SAL)
  • Page 529 PAL encapsulates those processor functions that are likely to change on an implementation to implementation basis so that SAL firmware and operating system software can maintain a consistent view of the processor. These include non-performance critical functions dealing such as processor initialization, configuration and error handling.
  • Page 530 11.1.3 PAL Entrypoints The following hardware events can trigger the execution of a PAL entrypoint: • Power-on/reset • Hardware errors (both correctable and uncorrectable) • Initialization event (via external interrupt bus message or processor pin) • Platform management interrupt (via external interrupt bus message or processor pin) These hardware events trigger the execution of one of the following PAL entrypoints (as shown in...
  • Page 531 11.1.5 OS Entrypoints There are several entrypoints from SAL into an operating system (or equivalent software). Entrypoints from SAL into the operating system are expected to meet the following model: • OS_BOOT Operating System Boot interface. – • OS_MCA Operating System Machine Check Abort Handler. –...
  • Page 532 Figure 11-3. Firmware Address Space IA-32 Reset Vector (16 Bytes) 4GB-16 SALE_ENTRY Address (8 Bytes) 4GB-24 Firmware Interface Table Address (8 Bytes) 4GB-32 64 Bytes PAL_A FIT Entry (16 Bytes) 4GB-48 Reserved (16 Bytes) Protected Bootblock) 4GB-64 CPU Reset PALE_RESET Init PALE_INIT PAL_A Block...
  • Page 533 Figure 11-4. Firmware Address Space with Processor-specific PAL_A Components IA-32 Reset Vector (16 Bytes) 4GB-16 SALE_ENTRY Address (8 Bytes) 4GB-24 Firmware Interface Table Address (8 Bytes) 4GB-32 64 Bytes PAL_A FIT Entry (16 Bytes) 4GB-48 Alternate Firmware Interface Table Address (Optional) (8 Bytes) 4GB-56 Reserved (8 Bytes)
  • Page 534 • The 8 bytes at 0xFFFF_FFE0 (4GB-32) contain the physical address of the Firmware Interface Table. • The 16 bytes at 0xFFFF_FFD0 (4GB-48) contain the FIT entry for the PAL_A (or generic PAL_A in the split PAL_A model) code provided by the processor vendor. The format of this FIT entry is described in Figure 11-6.
  • Page 535 At a minimum, all of the PAL firmware components, pointers at the top of the firmware address space, FIT tables and the portion of the SAL code that is executed at the RECOVERY CHECK hand-off must be accessible from the processor without any special system fabric initialization sequence.
  • Page 536 Figure 11-6. Firmware Interface Table Entry 56 55 32 31 24 23 48 47 Start + 16 Chksum Type Version (2 bytes) Reserved Size (3 bytes) Start + 8 Address (8 bytes) Start of entry • Size A 3-byte field containing the size of the component in bytes divided by 16. –...
  • Page 537 11.2 PAL Power On/Reset 11.2.1 PALE_RESET The purpose of PALE_RESET is to initialize and test the processor. Upon receipt of a power-on/reset event the processor begins executing code from the PALE_RESET entrypoint in the firmware address space. PALE_RESET initializes the processor and may perform a minimal processor self test.
  • Page 538 • GR34 contains the physical address for making a PAL procedure call. If the call is for RECOVERY CHECK, only the subset of PAL procedures needed for SALE_ENTRY to perform firmware recovery will be available. These procedures are: • PAL_FREQ_RATIOS •...
  • Page 539 • PSR: PSR.bn is 1; PSR.df1 and PSR.dfh are 1 if the floating-point unit failed self test. All other PSR bits are 0. PSR.ic and PSR.i are zero to ensure external interrupts, NMI and PMI interrupts are disabled. • CRs: The contents of all control registers are undefined except the following: •...
  • Page 540 • status – A function-dependent 8-bit field indicating the firmware status on entry to SALE_ENTRY. If the function value is RESET or RECOVERY_CHECK, the status values are: Table 11-4. status Field Values Status Value Description Normal Normal reset. FIT Header Failure FIT header for FIT and alternate FIT (if supported) is incorrect FIT Checksum Failure...
  • Page 541 Table 11-4. status Field Values (Continued) Status Value Description PAL_B Auth Failure / Good PAL_B found One or more compatible PAL_B's failed authentication and checksum. Another compatible PAL_B was found that passed authentication and checksum. 64K Unaligned No PAL_B was found in the FIT and alternate FIT (if supported) that was correctly aligned to a 64KB boundary.
  • Page 542 • state A 2-bit field indicating the state of the processor after self-test. If SAL – directed PAL to skip some self-tests by modifying the self-test control word, failures related to these self-tests will not be reflected in this state. Table 11-6.
  • Page 543 • test_status An unsigned 32-bit-field providing additional information on test – failures when the state field returns a value of PERFORMANCE RESTRICTED or FUNCTIONALLY RESTRICTED. The value returned is implementation dependent. 11.2.3 PAL Self-test Control Word The PAL self-test control word is a 48-bit value. This bit field is defined in Figure 11-10.
  • Page 544 11.3 Machine Checks 11.3.1 PALE_CHECK When a machine check abort (MCA) occurs, PALE_CHECK is responsible for saving minimal processor state to a uncacheable platform-specific memory location previously registered with PAL via the PAL_MC_REGISTER_MEM procedure. This platform location is called the Minimal State Save Area (min-state save area) and is described in Section 11.3.2.4, “Processor Min-state Save Area Layout”...
  • Page 545 For testing and configuration purposes, it may be necessary for software to intentionally generate a machine check. In this case PALE_CHECK will log the error information, but not attempt recovery before branching to SALE_ENTRY. To allow for this, the PAL_MC_EXPECTED procedure call is defined to indicate that PALE_CHECK should not to attempt recovery.
  • Page 546 • GR16 through GR20 (bank 0) contain parameters which PALE_CHECK passes to SALE_ENTRY for diagnostic and recovery purposes: • GR16 contains the address to the first available location in the min-state save area for use by SAL. The address is 8-byte aligned. •...
  • Page 547 • Cache: The processor internal cache is enabled and is unchanged from the time of the MCA except for any lines that were invalidated to correct the error. • TLB: The TCs may be initialized and the TRs are unchanged from the time of the MCA.
  • Page 548 Table 11-7. Processor State Parameter Fields (Continued) Field Bits Description Trap lost. A value of 1 indicates the machine check occurred after an instruction was executed but before a trap that resulted from the instruction execution could be generated. More information. A value of 1 indicates that more error information about the machine check event is available by making the PAL_MC_ERROR_INFO procedure call.
  • Page 549 11.3.2.1.1 Using Processor State Parameter to Determine if Software Recovery of a Machine Check is Possible The us, ci, co, and sy bits in the Processor State Parameter are valid only if the error has not been previously corrected in hardware or firmware (cm bit is 0). Even then, only the bit combinations shown in Table 11-8 are valid.
  • Page 550 After return from the SAL rendezvous call, PALE_CHECK will complete processing the machine check if the rendezvous was successful and then branch to SALE_ENTRY with GR19 set to zero. The processor state when transferring to SAL is as defined in Section 11.3.2, “PALE_CHECK Exit State”...
  • Page 551 area is architectural state needed by the PAL code to resume during MCA and INIT events (architected min-state save area + reserved). The remaining space in the buffer is a scratch space reserved exclusively for PAL use, therefore SAL and OS must not use this area.
  • Page 552 Figure 11-2. Processor State Saved in Min-state Save Area 0xf8 Bank 0 GR31 0xf0 Bank 0 GR30 0xe8 Bank 0 GR29 0xe0 Bank 0 GR28 GR16 0xd8 Bank 0 GR27 0xd0 Bank 0 GR26 0x1c8 0xc8 Bank 0 GR25 0xc0 Bank 0 GR24 0x1c0 XFS or undefined...
  • Page 553 The NaT bits stored in the first entry of the min-state save area have the following layout. Figure 11-3. NaT Bits for Saved GRs 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 NaT bits for Bank 0 GR16 to GR31 NaT bits for GR15 to GR1 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32...
  • Page 554 There are certain error cases that may require returning to a new context in order to recover from the machine check. If this occurs a new context can be returned to via the PAL_MC_RESUME procedure with the new_context flag set. The caller needs to set up the new processor min-state save area as shown in Figure 11-2 for all the listed...
  • Page 555 • If recovery is not supported when PSR.ic=0 then GR24 - GR31 (bank 0) are undefined and their contents have been lost. In this case, recovery is not possible. See Section 11.3.1.1, “Resources Required for Machine Check and Initialization Event Recovery” for details. •...
  • Page 556 • DBR/IBRs: The contents of all breakpoint registers are unchanged from the time of the INIT. • PMCs/PMDs: The contents of the PMC registers are unchanged from the time of the INIT. The contents of the PMD registers are not modified by PAL code, but may be modified if events it is monitoring are encountered.
  • Page 557 Table 11-12. Processor State Parameter Fields (Continued) INIT Field Bits Description value Uncontained storage damage. A value of 1 indicates the error is contained within the CPU and memory hierarchy, but that some memory locations may be corrupt. If us is set to 1, then co and sy will always be cleared to 0.
  • Page 558 Table 11-12. Processor State Parameter Fields (Continued) INIT Field Bits Description value Register file check. A value of 1 indicates that a register file related machine check occurred. See the PAL_MC_ERROR_INFO procedure call for more information. Uarch check. A value of 1 indicates that a micro-architectural related machine check occurred.
  • Page 559 to register its PALE_PMI entrypoint, processor operation is undefined. If a SAL related PMI is seen before the SAL PMI handler is registered, the PAL PMI code will just return to the interrupted context Figure 11-7. PMI Entrypoints PALE_PMI SALE_PMI The hardware events that can cause the PMI request are referred to as PMI events.
  • Page 560 Table 11-15. PMI Message Vector Assignments Priority Vector Description PAL Reserved High IA-32 Machine Check Rendezvous PAL Reserved 11.5.2 PALE_PMI Exit State The state of the processor on exiting PALE_PMI is: • GRs: The contents of non-banked general registers are unchanged from the time of the interruption.
  • Page 561 • BR0 PAL PMI return address. • ARs: The contents of all application registers are unchanged from the time of the interruption, except the RSE control register (RSC) and the ITC and RUC counters. The RSC.mode field will be set to 0 (enforced lazy mode) while the other fields in the RSC are unchanged.
  • Page 562 Figure 11-8 shows state transitions for the various power states and the software interfaces required for the transitions. Figure 11-8. Power States NORMAL/ LOW-POWER PAL_HALT_LIGHT PAL_HALT Unmasked external Unmasked external interrupts, Machine Interrupts, Machine check, Reset, PMI check, Reset, PMI and INIT and INIT LIGHT HALT...
  • Page 563 implement. It is the responsibility of the caller to ensure cache coherency in this state. • HALT 2 - 7 These are optional implementation-dependent states entered by – calling PAL_HALT with a power state argument in the range of 2-7. Before making this procedure call, the operating system software should first ascertain that the states are implemented by calling PAL_HALT_INFO.
  • Page 564 Figure 11-9. Power and Performance Characteristics for P-states Power Performance P-states can be utilized by software to implement a demand-based dynamic power management policy where it would continuously try to adapt the processor performance to the current workload characteristics. This allows software to achieve power savings at the system level, while allowing it to quickly respond to changing workload requirements.
  • Page 565 Figure 11-10. Example of a P-state Transition Policy Halt High Utilization Transitions initiated by software Utilization 11.6.1.1 Power Dependency Domains The concept of P-states applies to each logical processor, and this gives software the required granularity to individually control the power/performance characteristics for each available thread of execution in the system.
  • Page 566 parameters. Each P-state maps to a set of values for the domain parameters, and hence a P-state transition results in a change in the underlying power/performance characteristics for the logical processor. The Itanium architecture supports different types of dependency domains, which enables software to have different degrees of control for P-state changes affecting logical processors in the domain.
  • Page 567 A hardware-independent dependency domain (HIDD) is a self-contained domain that typically means that every logical processor is the only logical processor in that domain, and its domain parameters are individually controllable. Since there are no dependencies with any other logical processors, there is no P-state coordination needed for such domains.
  • Page 568 procedure, and the caller is expected to make another PAL_SET_PSTATE request to transition to the desired P-state. The transition_latency_2 field in the pstate_buffer returned by PAL_PSTATE_INFO indicates the time interval the caller needs to wait to have a reasonable chance of success when initiating another PAL_SET_PSTATE call. Implementation-specific event conditions may prevent a PAL_SET_PSTATE request from being accepted (e.g., due to a thermal protection mechanism), in which case the PAL procedure returns a status of transition failure.
  • Page 569 initiates a new performance_index count, which is reported when the next PAL_GET_PSTATE procedure call is made. A call to PAL_GET_PSTATE with a type operand of 1 resets the performance measurement logic. SCDD: If the logical processor belongs to a software-coordinated dependency domain, the performance index returned (for either type=0 or 3) corresponds to the target P-state requested by the most recent successful PAL_SET_PSTATE procedure call.
  • Page 570 As seen above, for a HCDD, the PAL_GET_PSTATE procedure allows the caller to get feedback on the dynamic performance of the processor over a software-controlled time period. The caller can use this information to get better system utilization over a subsequent time period by changing the P-state in correlation with the current workload demand.
  • Page 571 For example, let's say the minimum frequency of P0 is 1 GHz and the maximum frequency of P0 is 1.5 GHz. If we are at 1 GHz for a time period of 4, 1.25 GHz for a time period of 16 and 1.5 GHz for a time period of 20, the average performance index ((100 * 4) + (125 * 16) + (150 * 20)) / (5 + 15 + 20) = 135 The performance_index equation for other P-states can be calculated in a similar manner using their respective frequency index values.
  • Page 572 Figure 11-12. Interaction of P-states with HALT State Performance (P0) (P1) (P2) Enter HALT State Exit HALT State (P3) Time (Previous) GET SET(P3) (Current) GET As shown above, the value returned for performance_index does not account for the performance during the time spent by the logical processor in the HALT state. This provides for better accuracy in the value reported for performance_index, allowing the caller to make optimal adjustments to the system utilization even in scenarios where we have interactions between P-states and HALT state.
  • Page 573 The VMM is responsible for managing the set of available system resources (CPU, memory, peripherals) and implement policies to virtualize these resources. In order to support virtual processor operations, the VMM will create a virtual environment and associate logical processors with the virtual environment. A virtual environment consists of one or more logical processors plus the memory resource allocated by the VMM during PAL_VP_INIT_ENV.
  • Page 574 Table 11-16. Virtual Processor Descriptor (VPD) Name Entries Offset Description Class Virtualization Acceleration Control – these con- Control [always] trol bits enable virtualization acceleration of a particular resource or instruction. See Section 11.7.1.1, “Virtualization Controls” on page 2:329 for details. Virtualization Disable Control –...
  • Page 575 Table 11-16. Virtual Processor Descriptor (VPD) (Continued) Name Entries Offset Description Class Reserved 1336 Reserved Area – Reserved for future expan- Reserved sion. vpsr 1424 Virtual Processor Status Register – Represents Architectural State the Processor Status Register of the virtual pro- Table 11-17 cessor.
  • Page 576 Table 11-17. Virtual Processor Descriptor (VPD) – VPSR Field Bits Class User Mask = PSR{5:0} Reserved No accelerations require these fields. System Mask = PSR{23:0} Always a_int, a_from_psr a_from_psr 12:6, 16 Reserved a_from_psr Always PSR.l = PSR{31:0} a_from_psr 31:28 Reserved PSR{63:0} 33:32 No accelerations require these fields.
  • Page 577 Table 11-18. Virtual Processor Descriptor (VPD) – VCR[0-127] Register Name Class VCR0-15 No accelerations require these virtual control registers. VCR16 VIPSR a_from_int_cr, a_to_int_cr VCR17 VISR VCR18 No accelerations require this virtual control register. VCR19 VIIP a_from_int_cr, a_to_int_cr VCR20 VIFA Always VCR21 VITIR Always...
  • Page 578 Table 11-19. Virtualization Acceleration Control (vac) Fields (Continued) Field Description a_to_int_cr Enable the interruption control register (CR16-27) write optimization. See Section 11.7.4.2.3, “Interruption Control Register Write Optimization” on page 2:341 for details. a_from_psr Enable the processor status register read optimization. See Section 11.7.4.2.4, “MOV-from-PSR Optimization”...
  • Page 579 Table 11-20. Virtualization Disable Control (vdc) Fields (Continued) Field Bits Description d_to_pmd Disable PMD write virtualization – If 1, writes to the performance monitor data registers (PMDs) are not virtualized. Code running with PSR.vm==1 can write the performance monitor data registers of the logical processor directly and without handling off to the VMM.
  • Page 580 interruptions except the Virtualization vector. Virtualization vector will be delivered as virtualization intercept in the per-virtual-processor host IVT. See Section 11.7.3, “PAL Intercepts in Virtual Environment” on page 2:332 for details on PAL intercepts. In the virtual environment, the IVA (CR2) control register will be set by PAL virtualization-related procedures and services as summarized in Table 11-21.
  • Page 581 Section 11.7.3.1, “PAL Virtualization Intercept Handoff State” on page 2:333 describes the handoff state of the PAL intercepts. For all interruption vectors other than Virtualization vector, the architectural state at the corresponding IVA-based interruption vector is the same as defined in Chapter 8, “Interruption Vector Descriptions”...
  • Page 582 • IRRs: The contents of IRRs are not changed by PAL. Incoming interruptions may change the contents. • IFS: IFS is unchanged from the time of the interruption. • IIP: Contains the value of IP at the time of the interruption. •...
  • Page 583 Table 11-22. PAL Virtualization Intercept Handoff Cause (GR24) (Continued) Value Cause Description ptc_g Due to ptc.g instruction. ptc_ga Due to ptc.ga instruction. ptr_d Due to ptr.d instruction. ptr_i Due to ptr.i instruction. thash Due to thash instruction. ttag Due to ttag instruction. Due to tpa instruction.
  • Page 584 resource and perform the virtualized operations based on the virtual instance of the resource without handling off to the VMM. Section 11.7.4.2, “Virtualization Accelerations” on page 2:337 describes the supported Virtualization accelerations in the architecture. • Virtualization disables – Virtualization disables optimize the execution of virtualized instructions by disabling virtualization of a particular resource or instruction.
  • Page 585 11.7.4.1.2 Virtualization Cause Optimization Virtualization cause optimization is enabled by the cause bit in the config_options parameter of PAL_VP_INIT_ENV. When enabled, the causes of virtualization intercepts will be provided to the VMM during PAL intercept handoffs within the virtual environment. When disabled, no cause information will be provided during PAL intercept handoffs.
  • Page 586 Table 11-26. Virtualization Accelerations Summary Virtualization Optimization Acceleration Description Control (vac) Virtual External Interrupt Optimization a_int Section 11.7.4.2.1 Interruption Control Register Read Optimization a_from_int_cr Section 11.7.4.2.2 Interruption Control Register Write Optimization a_to_int_cr Section 11.7.4.2.3 MOV-from-PSR Optimization a_from_psr Section 11.7.4.2.4 MOV-from-CPUID Optimization a_from_cpuid Section 11.7.4.2.5 Cover Optimization...
  • Page 587 When this optimization is enabled, execution of rsm and ssm instructions , with PSR.vm==1 and system mask equal to zero (0x0), will not intercept to the VMM unless a fault condition is detected (see Table 11-29 for details). A virtual external interrupt is raised if the virtual highest priority pending interrupt (vhpi) is unmasked by the new vpsr.i and vtpr.
  • Page 588 Table 11-29. Interruptions when Virtual External Interrupt Optimization is Enabled Instructions Interruptions When the virtual external interrupt optimization is enabled, execution rsm, ssm of rsm and ssm instructions with PSR.vm==1 which modify only vpsr.i, may raise the following faults: • Privileged Operation fault – if vpsr.cpl is not zero MOV-from-TPR When the virtual external interrupt optimization is enabled, execution of MOV-from-CR instruction targeting vtpr with PSR.vm==1, may...
  • Page 589 Table 11-31. Interruptions when Interruption Control Register Read Optimization is Enabled Instructions Interruptions Move from interruption control registers When the interruption control register read optimization is enabled, reads of interruption control registers with PSR.vm==1, may raise the following faults: • Illegal Operation fault – if vpsr.ic is not zero or the target operand specifies GR 0 or an out-of-frame stacked register •...
  • Page 590 the virtual processor status register without any intercepts to the VMM; and the last value written to the vpsr will be returned, unless a fault condition is detected (see Table 11-35 for details). The value returned for the fml, mfh, ac, up and be bits are simply the values of those bits in the PSR of the logical processor, since those bits are not virtualized.
  • Page 591 Table 11-36. Synchronization Requirements for MOV-from-CPUID Optimization VPD Resource Synchronization Required vcpuid0-4 Write Table 11-37. Interruptions when MOV-from-CPUID Optimization is Enabled Instructions Interruptions MOV-from-CPUID When the MOV-from-CPUID optimization is enabled, MOV-from-CPUID instructions with PSR.vm==1, may raise the fol- lowing faults: •...
  • Page 592 corresponding NaT bits from the VPD. vpsr.bn is updated to reflect the new register bank without any intercepts to the VMM, unless a fault condition is detected (see Table 11-46 for details). If this optimization is disabled, execution of the bsw instruction with PSR.vm==1 results in a virtualization intercept.
  • Page 593 There is no synchronization requirement for the virtualization of instructions. probe 11.7.4.2.9 Test Feature Optimization The test feature optimization is enabled by the a_tf bit in the Virtualization Acceleration Control (vac) field in the VPD. When this optimization is enabled, test feature (tf) instructions running with PSR.vm==1 will test the VCPUID[4] register in the VPD.
  • Page 594 When this optimization is enabled, execution of rsm and ssm instructions, with PSR.vm==1 and system mask equal to zero (0x0), will not intercept to the VMM unless a fault condition is detected (see Table 11-45 for details). When PSR.vm==1, execution of rsm and ssm instructions , which modify any bits other than vpsr.ic and user mask fields will result in virtualization intercepts independent of whether this optimization is enabled or not.
  • Page 595 Table 11-46. Virtualization Disables Summary (Continued) Virtualization Disable Disable Control Description (vdc) Disable ITM Virtualization d_itm Section 11.7.4.3.6 Disable PSR Interrupt-bit Virtualization d_psr_i Section 11.7.4.3.7 a. The Virtualization Disable Control (vdc) field resides in the Virtual Processor Descriptor (VPD), see Section 11.7.1, “Virtual Processor Descriptor (VPD)”...
  • Page 596 11.7.4.3.4 Disable PMC Virtualization The PMC virtualization disable is controlled by the d_pmc bit in the Virtualization Disable Control (vdc) field in the VPD. When this control is set to 1, accesses (reads/writes) to the performance monitor configuration registers (PMCs) are not virtualized, and code running with PSR.vm==1 can read and write these resources directly without any intercepts to the VMM.
  • Page 597 11.7.4.4 Virtualization Optimization Combinations Table 11-47 describes the supported combinations of virtualization accelerations and disables. Table 11-47.Supported Virtualization Optimization Combinations d_vmsw d_extint d_ibr_dbr d_pmc d_to_pmd d_itm d_psr_i a_int a_from_int_cr a_to_int_cr a_from_psr a_from_cpuid a_cover a_bsw a_all_probes a_select_probes a_tf a_ic_um a. “o” indicates the corresponding virtualization acceleration and disable can be enabled together. b.
  • Page 598 1. Read synchronization – When a specific acceleration is enabled, after interruptions and intercepts that occur when PSR.vm was 1, the VMM must invoke PAL_VPS_SYNC_READ to synchronize the related resources before reading their values from the VPD. 2. Write synchronization – When a specific acceleration is enabled, the VMM must invoke PAL_VPS_SYNC_WRITE to synchronize the related resources after modifying their values in the VPD and before resuming the virtual processor.
  • Page 599 Machine Check (MC) A machine check is a hardware event that indicates that a hardware error or architectural violation has occurred that threatens to damage the architectural state of the machine, possibly causing data corruption. The occurrence of the error triggers the execution of firmware code which records information about the error, and attempts to recover when possible.
  • Page 600 Scratch When applied to either an entrypoint or procedure, scratch means that the contents of the register has no meaning and need not be preserved. Further the register is available for the storage of local variables. Unless otherwise noted, the register should not be relied upon to contain any particular value after exit.
  • Page 601 • During the execution of PAL procedures to the memory buffer allocated by the caller of the procedure using the memory attribute of the address passed by the caller. • PAL may also issue loads from the architected firmware address space and loads/stores from the registered min-state save area whenever it is executing a PAL procedure or handling PAL-based interruptions (reset, MCA, INIT and PMI).
  • Page 602 Table 11-48. PAL Procedure Index Assignment Index Description Reserved 1 - 255 Architected procedures; static register calling conventions 256 - 511 Architected procedures; stacked register calling conventions 512 - 767 Implementation-specific procedures; static registers calling conventions 768 - 1023 Implementation-specific procedures; stacked register calling conventions 1024 + Reserved The assignment of indices for all architected procedures is controlled by this document.
  • Page 603 Table 11-49.PAL Cache and Memory Procedures (Continued) Procedure Class Conv. Mode Buffer Description PAL_CACHE_PROT_INFO Req. Static Both Return instruction or data cache protection information. PAL_CACHE_SHARED_INFO Opt. Static Both Returns information on which logical processors share caches. PAL_CACHE_SUMMARY Req. Static Both Return a summary of the cache hierarchy.
  • Page 604 Table 11-50.PAL Processor Identification, Features, and Configuration Procedures Procedure Class Conv. Mode Buffer Description PAL_PROC_SET_FEATURES Req. Static Phys. Enable or disable configurable processor features. PAL_REGISTER_INFO Req. Static Both Return AR and CR register information. PAL_RSE_INFO Req. Static Both Return RSE information. PAL_SET_HW_POLICY Opt.
  • Page 605 a. Calling this procedure may affect resources on multiple processors. Please refer to implementation-specific reference manuals for details. Table 11-53.PAL Processor Self Test Procedures Procedure Class Conv. Mode Buffer Description PAL_CACHE_LINE_INIT Req. Static Phys. Initialize tags and data of a cache line for processor testing.
  • Page 606 Table 11-55.PAL Virtualization Support Procedures (Continued) Procedure Class Conv. Mode Buffer Description PAL_VP_SAVE 271 Opt. Stacked Virt. Dep. Save virtual processor state on the logical processor. PAL_VP_TERMINATE 272 Opt. Stacked Virt. Dep. Terminates operation for the specified virtual processor. 11.10.2 PAL Calling Conventions The following general rules govern the definition of the PAL procedure calling conventions.
  • Page 607 11.10.2.1.3 Making PAL Procedure Calls in Physical or Virtual Mode PAL procedure calls which are made in physical mode must obey the calling conventions described in this chapter, but there are no additional restrictions beyond those noted above. PAL procedure calls made in virtual mode must have the region occupied by PAL_PROC virtually mapped with an ITR.
  • Page 608 Table 11-56. State Requirements for PSR (Continued) PSR Bit Description Entry Exit Class protection key validation enable unchanged data address translation enable unchanged preserved disabled FP register f2 to f31 unchanged disabled FP register f32 to f127 unchanged unchanged secure performance monitors unchanged privileged performance monitor enable unchanged...
  • Page 609 Table 11-57. Definition of Terms Term Description Must be zero at entry to the procedure or on exit from the procedure. If the value at entry is not zero, the procedure may return an illegal argument or execute in an undefined manner. Must be one at entry to the procedure or on exit from the procedure.
  • Page 610 Table 11-58. System Register Conventions (Continued) Name Description Class CMCV Corrected Machine Check Vector unchanged LRR0-LRR1 Local Redirection Registers 0-1 unchanged Region Registers preserved Protection Key Registers preserved Translation Registers unchanged Translation Cache scratch IBR/DBR Break Point Registers preserved Performance Monitor Control Registers preserved Performance Monitor Data Registers unchanged...
  • Page 611 Table 11-60. General Registers – Stacked Calling Conventions (Continued) Register Conventions GR8 - GR11 scratch, procedure return value GR12 special, stack pointer (sp) GR13 special, thread pointer (tp) GR14 - GR27 scratch GR28 input argument, scratch (PAL Index must be passed in GR28) GR29-GR31 scratch Bank 0 Registers...
  • Page 612 Table 11-61. Application Register Conventions Register Description Class BSPSTORE Backing Store Pointer for Memory Stores unchanged RNAT RSE NaT Collection Register unchanged IA-32 Floating-point Control Registers preserved EFLAG IA-32 EFLAG register preserved IA-32 Code Segment Descriptor preserved IA-32 Stack Segment Descriptor preserved CFLG IA-32 Combined CR0 and CR4 Register...
  • Page 613 11.10.3 PAL Procedure Specifications The following pages provide detailed interface specifications for each of the PAL procedures defined in this document. Included in the specification are the input parameters, the output parameters, and any required behavior. Volume 2, Part 1: Processor Abstraction Layer 2:365...
  • Page 614 PAL_BRAND_INFO PAL_BRAND_INFO – Provides Processor Branding Information (274) Provides processor branding information. Purpose: Stacked Registers Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_BRAND_INFO within the list of PAL procedures. info_request Unsigned 64-bit integer specifying the information that is being requested. (See Table 11-62) address...
  • Page 615 PAL_BUS_GET_FEATURES PAL_BUS_GET_FEATURES – Get Processor Bus Dependent Configuration Features (9) Provides information about configurable processor bus features. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_BUS_GET_FEATURES within the list of PAL procedures. Reserved Reserved Reserved...
  • Page 616 PAL_BUS_GET_FEATURES Table 11-63. Processor Bus Features Bits Class Control Description Opt. Req. Disable Bus Data Error Checking. When 0, bus data errors are detected and single bit errors are corrected. When 1, no error detection or correction is done. Opt. Req.
  • Page 617 PAL_BUS_SET_FEATURES PAL_BUS_SET_FEATURES – Set Processor Bus Dependent Configuration Features (10) Enables/disables specific processor bus features. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_BUS_SET_FEATURES within the list of PAL procedures. feature_select 64-bit vector denoting desired state of each feature (1=select, 0=non-select).
  • Page 618 PAL_CACHE_FLUSH PAL_CACHE_FLUSH – Flush Data or Instruction Caches (1) Flushes the processor instruction or data caches. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_FLUSH within the list of PAL procedures. cache_type Unsigned 64-bit integer indicating which cache to flush.
  • Page 619 PAL_CACHE_FLUSH throughout the coherence domain. The procedure will perform the necessary serialization and synchronization as required by the architecture. This call does not ensure that data in the processors coalescing buffers are flushed to memory. See Section 4.4.5, “Coalescing Attribute” on page 2:78 on how to flush the coalescing buffers.
  • Page 620 PAL_CACHE_FLUSH Table 11-66. Cache Line State when inv = 1 Old State New State Comments Invalid Invalid Clean Invalid Modified Invalid Modified data is copied back to memory. The progress_indicator is an unsigned 64-bit integer specifying the starting position of the flush operation.
  • Page 621 PAL_CACHE_FLUSH calling this routine. Alternatively, software can disable the TLBs by setting PSR.it, PSR.dt, and PSR.rt to 0. • The specified caches may also contain PAL firmware code cache entries required to flush the cache. • The specified caches may contain PAL and SAL PMI code if this call was made with PSR.ic = 1 and a PMI interrupt is seen during the execution of the call.
  • Page 622 PAL_CACHE_INFO PAL_CACHE_INFO – Get Detailed Cache Information (2) Returns information about a particular processor instruction or data cache at a specified Purpose: level in the cache hierarchy. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index...
  • Page 623 PAL_CACHE_INFO cache if the cache contents never get flushed to memory (for example an instruction cache). • stride Unsigned 8-bit integer denoting the binary log of the most effective stride – in bytes for flushing the cache. • store_latency Unsigned 8-bit integer denoting the number of cycles after issue –...
  • Page 624 PAL_CACHE_INIT PAL_CACHE_INIT – Initialize Caches (3) Initializes the processor controlled caches. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_INIT within the list of PAL procedures. level Unsigned 64-bit integer containing the level of cache to initialize. If the cache level can be initialized independently, only that level will be initialized.
  • Page 625 PAL_CACHE_LINE_INIT PAL_CACHE_LINE_INIT – Initialize a Data Cache Line (31) Initializes the tags and data of a data or unified cache line of a processor controlled Purpose: cache to known values without the availability of backing memory. Static Calling Conv: Physical Mode: Not dependent Buffer:...
  • Page 626 PAL_CACHE_PROT_INFO PAL_CACHE_PROT_INFO – Get Detailed Cache Protection Information (38) Returns protection information about a particular processor instruction or data cache at Purpose: a specified level in the cache hierarchy. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument...
  • Page 627 PAL_CACHE_PROT_INFO Figure 11-6. config_info_3 Return Value 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 cache_protection[4] 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 cache_protection[5] Each cache_protection element has the following structure: Figure 11-7.
  • Page 628 PAL_CACHE_READ PAL_CACHE_READ – Read Values from the Processor Cache (259) Reads the data and tag of a processor-controlled cache line for diagnostic testing. Purpose: Stacked Registers Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_READ within the list of PAL procedures. line_id 8-byte formatted value describing where in the cache to read the data.
  • Page 629 PAL_CACHE_READ Table 11-74. part Input Values Value Description data data protection bits tag protection bits combined protection bits for data and tags a. Note that for this part no data is returned. Only protection bits are returned. All other values of part are reserved. The data return value contains the value read from the cache.
  • Page 630 PAL_CACHE_SHARED_INFO PAL_CACHE_SHARED_INFO – Get Information on Caches Shared by Logical Processors (43) Returns information on caches shared between logical processors. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_SHARED_INFO within the list of PAL procedures. cache_level Unsigned 64-bit integer specifying the level in the cache hierarchy for which information is requested.
  • Page 631 PAL_CACHE_SHARED_INFO Figure 11-9. Layout of proc_n_cache_info1 Return Value 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 •...
  • Page 632 PAL_CACHE_SUMMARY PAL_CACHE_SUMMARY – Get Cache Hierarchy Summary (4) Returns summary information about the hierarchy of caches controlled by the Purpose: processor. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_SUMMARY within the list of PAL procedures. Reserved Reserved Reserved...
  • Page 633 PAL_CACHE_WRITE PAL_CACHE_WRITE – Write Values into the Processor Cache (260) Writes the data and tag of a processor-controlled cache line for diagnostic testing. Purpose: Stacked Registers Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_WRITE within the list of PAL procedures. line_id 8-byte formatted value describing where in the cache to write the data.
  • Page 634 PAL_CACHE_WRITE Table 11-77. part Input Values Value Description data data protection tag protection combined data and tag protection All other values of part are reserved. • mesi Unsigned 8-bit integer denoting whether the line should be written as clean – or dirty, shared or exclusive.
  • Page 635 PAL_CACHE_WRITE To guarantee correct behavior for this procedure, it is required that there shall be no RSE activity that may cause cache side effects. Volume 2, Part 1: Processor Abstraction Layer 2:387...
  • Page 636 PAL_COPY_INFO PAL_COPY_INFO – Return Parameters to Copy PAL Code to Memory (30) Returns the parameters needed to copy relocatable PAL code from the firmware Purpose: address space to memory. Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index...
  • Page 637 PAL_COPY_PAL PAL_COPY_PAL – Copy PAL Code to Memory (256) Copy relocatable PAL code from the firmware address space to memory. Purpose: Stacked Registers Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_COPY_PAL within the list of PAL procedures. target_addr Physical address of a memory buffer to copy relocatable PAL procedures and PAL PMI code.
  • Page 638 PAL_DEBUG_INFO PAL_DEBUG_INFO – Get Debug Registers Information (11) Returns the number of instruction and data debug register pairs. Purpose: Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_DEBUG_INFO within the list of PAL procedures. Reserved Reserved Reserved...
  • Page 639 PAL_FIXED_ADDR PAL_FIXED_ADDR – Get Fixed Geographical Address of Processor (12) Returns a unique geographical address of this processor. Purpose: Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_FIXED_ADDR call within the list of PAL procedures. Reserved Reserved Reserved...
  • Page 640 PAL_FREQ_BASE PAL_FREQ_BASE – Get Processor Base Frequency (13) Returns the frequency of the output clock for use by the platform is generated by the Purpose: processor. Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_FREQ_BASE within the list of PAL procedures.
  • Page 641 PAL_FREQ_RATIOS PAL_FREQ_RATIOS – Get Processor Frequency Ratios (14) Returns the ratios of the processor frequency, bus frequency, and interval timer to the Purpose: input clock of the processor, if the platform clock is generated externally or to the output clock to the platform, if the platform clock is generated by the processor. Static Registers Only Calling Conv: Physical or Virtual...
  • Page 642 PAL_GET_HW_POLICY PAL_GET_HW_POLICY – Retrieve Current Hardware Resource Sharing Policy (48) Returns the current hardware resource sharing policy of the processor. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_GET_HW_POLICY within the list of PAL procedures. proc_num Unsigned 64-bit integer that specifies for which logical processor information is being requested.
  • Page 643 PAL_GET_HW_POLICY Table 11-80. Hardware policies returned in cur_policy Value Name Description Performance The processor has its hardware resources configured to achieve maximum performance across all logical processors that share hardware with the logical processor the procedure was made on. Fairness The processor has its hardware resources configured to approximately achieve equal sharing of competing hardware resources among all the logical processors that share hardware...
  • Page 644 PAL_GET_PSTATE PAL_GET_PSTATE – Return Information on the Performance Index of the Processor (262) Returns the performance index of the processor. Purpose: Stacked Registers Calling Conv: Physical and Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_GET_PSTATE within the list of PAL procedures. type Type of performance_index value to be returned by this procedure.
  • Page 645 PAL_GET_PSTATE Table 11-81. PAL_GET_PSTATE type Argument type Description The performance_index returned will correspond to the target P-state requested by software. • For SCDD (software-coordinated dependency domain) logical processors, this is the P-state requested by the most recent PAL_SET_PSTATE procedure call made by any logical processor in the domain.
  • Page 646 PAL_GET_PSTATE type=2, the procedure will return the performance_index value corresponding to the processor performance in the time duration between the previous call to PAL_GET_PSTATE with type=1 and the current call. If the processor had transitioned to a HALT state (see Section 11.6.1, “Power/Performance States (P-states)”...
  • Page 647 PAL_HALT PAL_HALT – Halt Processor (28) Causes the processor to enter the HALT state, or one of the implementation-dependent Purpose: low-power states. Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_HALT within the list of PAL procedures. halt_state Unsigned 64-bit integer denoting low power state requested.
  • Page 648 PAL_HALT • I/O type is an unsigned 8-bit integer denoting the type of I/O transaction to complete. Table 11-83. I/O Type Definition Value Description No transaction Perform a load Perform a store All other values for I/O type are reserved. •...
  • Page 649 PAL_HALT_INFO PAL_HALT_INFO – Get Halt State Information for Power Management (257) Returns information about the processor’s power management capabilities. Purpose: Stacked Registers Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_HALT_INFO within the list of PAL procedures. power_buffer 64-bit pointer to a 64-byte buffer aligned on an 8-byte boundary.
  • Page 650 PAL_HALT_INFO The latency numbers given are the minimum number of processor cycles that will be required to transition the states. The maximum or average cannot be determined by PAL due to its dependency on outstanding bus transactions. For more information on power management, please refer to Section 11.6, “Power Management”...
  • Page 651 PAL_HALT_LIGHT PAL_HALT_LIGHT – Cause Processor to Enter Coherent Halt State (29) Causes the processor to enter the LIGHT HALT state, where prefetching and execution Purpose: are suspended, but cache and TLB coherency is maintained. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer:...
  • Page 652 PAL_LOGICAL_TO_PHYSICAL PAL_LOGICAL_TO_PHYSICAL – Get Information on Logical to Physical Processor Mappings (42) Returns information on the logical to physical processor mapping. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_LOGICAL_TO_PHYSICAL within the list of PAL procedures. proc_number Signed 64-bit integer that specifies for which logical processor information is being requested.
  • Page 653 PAL_LOGICAL_TO_PHYSICAL Figure 11-15. Layout of log_overview Return Value 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 num_log 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 ppid •...
  • Page 654 PAL_LOGICAL_TO_PHYSICAL Figure 11-17. Layout of proc_n_log_info2 Return Value 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 •...
  • Page 655 PAL_MC_CLEAR_LOG PAL_MC_CLEAR_LOG – Clear Processor Error Logging Registers (21) Clears all processor error logging registers and resets the indicator that allows the error Purpose: logging registers to be written. This procedure also checks the pending machine check bit and pending INIT bit and reports their states. Static Registers Only Calling Conv: Physical and Virtual...
  • Page 656 PAL_MC_DRAIN PAL_MC_DRAIN – Complete Outstanding Transactions (22) Ensures that all outstanding transactions in a processor are completed or that any MCA Purpose: due to these outstanding transactions is taken. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument...
  • Page 657 PAL_MC_DYNAMIC_STATE PAL_MC_DYNAMIC_STATE – Returns Dynamic Processor State (24) Returns the Machine Check Dynamic Processor State. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_MC_DYNAMIC_STATE within the list of PAL procedures. info_type Unsigned 64-bit value indicating the type of information to return dy_buffer...
  • Page 658 PAL_MC_ERROR_INFO PAL_MC_ERROR_INFO – Get Processor Error Information (25) Returns the Processor Machine Check Information Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_MC_ERROR_INFO within the list of PAL procedures. info_index Unsigned 64-bit integer identifying the error information that is being requested.
  • Page 659 PAL_MC_ERROR_INFO Table 11-86. info_index Values info_index Error Information Type Description Processor Error Map This info_index value will return the processor error map. This return value specifies the processor core identification, the processor thread identification, and a bit-map indicating which structure(s) of the processor generated the machine check.
  • Page 660 PAL_MC_ERROR_INFO Figure 11-19. level_index Layout 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 rsvd Table 11-87. level_index Fields Field Bits Description Processor core ID (default is 0 for processors with a single core) Logical thread ID (default is 0 for processors that execute a single thread) 11:8 Error information is available for 1st, 2nd, 3rd, and 4th level instruction caches...
  • Page 661 PAL_MC_ERROR_INFO Table 11-88. err_type_index Values (Continued) err_type_index Return Value Description value mod 8 Responder identifier The responder identifier is a 64-bit integer that specifies the bus agent that responded to a transaction that was responsible for generating the machine check. The structure-specific error information informs the caller if there is a valid responder identifier.
  • Page 662 PAL_MC_ERROR_INFO instruction pointer available for logging on the second error. If there is, it makes sub-sequent calls with err_type_index equal to 9, 10, 11, and/or 12 depending on which valid bits are set. The caller continues incrementing the err_type_index value in this fashion until the inc_err_type return value is zero.
  • Page 663 Reserved Instruction set. If this value is set to zero, the instruction that generated the machine check was an Intel Itanium instruction. If this bit is set to one, the instruction that generated the machine check was IA-32 instruction. The is field in the cache_check parameter is valid.
  • Page 664 Reserved Instruction set. If this value is set to zero, the instruction that generated the machine check was an Intel Itanium instruction. If this bit is set to one, the instruction that generated the machine check was IA-32 instruction. The is field in the TLB_check parameter is valid.
  • Page 665 Reserved Instruction set. If this value is set to zero, the instruction that generated the machine check was an Intel Itanium instruction. If this bit is set to one, the instruction that generated the machine check was IA-32 instruction. The is field in the bus_check parameter is valid.
  • Page 666 Reserved Instruction set. If this value is set to zero, the instruction that generated the machine check was an Intel Itanium instruction. If this bit is set to one, the instruction that generated the machine check was IA-32 instruction. The is field in the reg_file_check parameter is valid.
  • Page 667 PAL_MC_ERROR_INFO Table 11-93. reg_file_check Fields Field Bits Description 57:56 Privilege level. The privilege level of the instruction bundle responsible for generating the machine check. The pl field of the reg_file_check parameter is valid. Machine check corrected: This bit is set to one to indicate that the machine check has been corrected.
  • Page 668 Reserved Instruction set. If this value is set to zero, the instruction that generated the machine check was an Intel Itanium instruction. If this bit is set to one, the instruction that generated the machine check was IA-32 instruction. The is field in the bus_check parameter is valid.
  • Page 669 PAL_MC_ERROR_INJECT PAL_MC_ERROR_INJECT – Inject Processor Error (276) Injects the requested processor error or returns information on the supported injection Purpose: capabilities for this particular processor implementation. Stacked Calling Conv: Physical and Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_MC_ERROR_INJECT within the list of PAL procedures. err_type_info Unsigned 64-bit integer specifying the first level error information which identifies the error structure and corresponding structure hierarchy, and the error severity.
  • Page 670 PAL_MC_ERROR_INJECT Table 11-95. err_type_info Field Bits Description mode Indicates the mode of operation for this procedure: 0 – Query mode 1 – Error inject mode (err_inj should also be specified) 2 – Cancel outstanding trigger. All other fields in err_type_info, err_struct_info and err_data_buffer are ignored.
  • Page 671 PAL_MC_ERROR_INJECT supported for error injection. The caller is required to use the query mode with appropriate inputs in err_struct_info to determine which combinations of error injection types are supported. If a given combination is not supported, the procedure returns with status -5. The procedure supports both an Error inject and Error inject and consume mode (selectable through the err_inj field in err_type_info).
  • Page 672 PAL_MC_ERROR_INJECT Table 11-96. resources Return Value Field Bits Description ibr0 When 1, indicates that IBR0,1 are being used by the procedure for trigger functionality. ibr2 When 1, indicates that IBR2,3 are being used by the procedure for trigger functionality. ibr4 When 1, indicates that IBR4,5 are being used by the procedure for trigger functionality.
  • Page 673 PAL_MC_ERROR_INJECT Table 11-97. err_struct_info – Cache (Continued) Field Bits Description cl_id Indicates which mechanism is used to identify the cache line to be used for error injection: 0 – Reserved 1 – Virtual address provided in the inj_addr field of the buffer pointed to by err_data_buffer should be used to identify the cache line for error injection.
  • Page 674 PAL_MC_ERROR_INJECT Table 11-98. capabilities vector for cache (Continued) Field Bits Description Error injection in tag portion of cache line is supported data Error injection in data portion of cache line is supported mesi Error injection in mesi portion of cache line is supported Error injection that results in data poisoning events is supported Reserved Reserved...
  • Page 675 PAL_MC_ERROR_INJECT Figure 11-30. err_struct_info – TLB 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 Reserved tr_slot tc_tr 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 37 36 35 34 33 32 Reserved trigger_pl...
  • Page 676 PAL_MC_ERROR_INJECT Figure 11-31. capabilities vector for TLB 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 Reserved tr tc rv 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 Reserved trigger_pl...
  • Page 677 PAL_MC_ERROR_INJECT Table 11-103. err_struct_info – Register File Field Bits Description When 1, indicates that the structure information fields (regfile_id, reg_num) are valid and should be used for error injection. When 0, the structure information fields are ignored, and the values of these fields used for error injection are implementation-specific. regfile_id Identifies the register file where the error should be injected: 0 –...
  • Page 678 PAL_MC_ERROR_INJECT Figure 11-34. capabilities Vector for Register File 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 15 14 13 12 11 10 9 7 6 5 4 3 2 Reserved regnum rsvd pmd pmc ibr dbr pkr rr cr ar pr br fr gr_b1 gr_b0 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49...
  • Page 679 PAL_MC_ERROR_INJECT Figure 11-36. err_struct_info – Bus/Processor Interconnect 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 Reserved 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 Reserved Table 11-106.
  • Page 680 PAL_MC_HW_TRACKING PAL_MC_HW_TRACKING – Query which hardware structures are performing hardware status tracking (51) Provide a way to query which hardware structures are performing hardware status Purpose: tracking for corrected machine check events. Static Registers Only Calling Conv: Physical and Virtual Mode: Dependent Buffer:...
  • Page 681 PAL_MC_HW_TRACKING The convention for the levels in the hw_track field is such that the least significant bit in the field represents the lowest level of the structures hierarchy. For example, bit 0 of the ICT field represents the first level instruction cache. Volume 2, Part 1: Processor Abstraction Layer 2:433...
  • Page 682 PAL_MC_EXPECTED PAL_MC_EXPECTED – Set/Reset Expected Machine Check Indicator (23) Informs PALE_CHECK whether a machine check is expected so that PALE_CHECK will Purpose: not attempt to correct any expected machine checks. Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description...
  • Page 683 PAL_MC_REGISTER_MEM PAL_MC_REGISTER_MEM – Register Memory with PAL for Machine Check and Init (27) Registers a platform dependent location with PAL to which it can save minimal Purpose: processor state in the event of a machine check or initialization event. Static Registers Only Calling Conv: Physical Mode:...
  • Page 684 PAL_MC_RESUME PAL_MC_RESUME – Restore Minimal Architected State and Return (26) Restores the minimal architectural processor state, sets the CMC interrupt if necessary, Purpose: and resumes execution. Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_MC_RESUME within the list of PAL procedures.
  • Page 685 PAL_MEM_ATTRIB PAL_MEM_ATTRIB – Get Memory Attributes (5) Returns the memory attributes implemented by processor. Purpose: Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_MEM_ATTRIB within the list of PAL procedures. Reserved Reserved Reserved...
  • Page 686 PAL_MEMORY_BUFFER PAL_MEMORY_BUFFER – Allocate a cacheable memory buffer for exclusive PAL usage (277) Provides cacheable memory to PAL for exclusive use during runtime. Purpose: Stacked Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_MEMORY_BUFFER within the list of PAL procedures. base_address Physical address of the memory buffer allocated for PAL use.
  • Page 687 PAL_MEMORY_BUFFER A memory buffer must be allocated for each physical package, and is shared by all logical processors on that package. Software is required to call this procedure on all logical processors on a given package with the same input values. If not, processor operation is undefined.
  • Page 688 PAL_PERF_MON_INFO PAL_PERF_MON_INFO – Get Processor Performance Monitor Information (15) Returns Performance Monitor information about what can be counted and how to Purpose: configure the monitors to count the desired events. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument...
  • Page 689 PAL_PERF_MON_INFO Table 11-111. pm_buffer Layout (Continued) Offset Description 0x40 256-bit mask defining which registers can count cycles. 0x60 256-bit mask defining which registers can count retired bundles. Volume 2, Part 1: Processor Abstraction Layer 2:441...
  • Page 690 PAL_PLATFORM_ADDR PAL_PLATFORM_ADDR – Set Processor Interrupt Block Address and I/O Port Space Address (16) Specifies the physical address of the processor Interrupt Block and I/O Port Space. Purpose: Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description...
  • Page 691 PAL_PMI_ENTRYPOINT PAL_PMI_ENTRYPOINT – Setup SAL PMI Entrypoint in Memory (32) Sets the SAL PMI entrypoint in memory. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_PMI_ENTRYPOINT within the list of PAL procedures. SAL_PMI_entry 256-byte aligned physical address of SAL PMI entrypoint in memory.
  • Page 692 PAL_PREFETCH_VISIBILITY PAL_PREFETCH_VISIBILITY – Make Processor Prefetches Visible (41) Used in the architected sequences for memory attribute transitions described in Purpose: Section 4.4.11, “Memory Attribute Transition” on page 2:88 to transition a page (or set of pages) from a one memory attribute to another. Static Registers Only Calling Conv: Physical and Virtual...
  • Page 693 PAL_PREFETCH_VISIBILITY This procedure, when used to delete a memory range on-line, will ensure that all of the conditions described in both of the preceding paragraphs regarding transition of virtual memory attributes and physical memory attributes are met. If the processor implementation does not require this procedure call to be made on remote processors in the sequences, this procedure will return a 1 upon successful completion.
  • Page 694 PAL_PROC_GET_FEATURES PAL_PROC_GET_FEATURES – Get Processor Dependent Features (17) Provides information about configurable processor features. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_PROC_GET_FEATURES within the list of PAL procedures. Reserved feature_set Feature set information is being requested for.
  • Page 695 PAL_PROC_GET_FEATURES the feature may optionally be controllable, and No indicates that the feature cannot be controllable. The control field applies only when the feature is available. The sense of the bits is chosen so that for features which are controllable, the default hand-off value at exit from PALE_RESET should be 0.
  • Page 696 PAL_PROC_GET_FEATURES Table 11-112. Processor Features (Continued) Class Control Scope Description Opt. Req. Enable the use of the vmsw instruction. When 0, the vmsw instruction causes a Virtualization fault when executed at the most privileged level. When 1, this bit will enable normal operation of the vmsw instruction. This bit has no effect if virtual machine features are disabled (see bit 40).
  • Page 697 PAL_PROC_GET_FEATURES Table 11-112. Processor Features (Continued) Class Control Scope Description Opt. Opt. Virtual Machine features implemented and enabled. When 1, PSR.vm is implemented and virtual machines features are not disabled. When 0 (features_status) and when the corresponding features_avail bit is 1, virtual machines features are implemented but are disabled.
  • Page 698 PAL_PROC_SET_FEATURES PAL_PROC_SET_FEATURES – Set Processor Dependent Features (18) Enables/disables specific processor features. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_PROC_SET_FEATURES within the list of PAL procedures. feature_select 64-bit vector denoting desired state of each feature (1=select, 0=non-select). feature_set Feature set to apply changes to.
  • Page 699 PAL_PSTATE_INFO PAL_PSTATE_INFO – Get Information for Power/Performance States (44) Returns information about the P-states supported by the processor. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_PSTATE_INFO within the list of PAL procedures. pstate_buffer 64-bit pointer to a 256-byte buffer aligned on an 8-byte boundary.
  • Page 700 PAL_PSTATE_INFO performance in the P0 state. For example, if the P1-state has a value of 75, and the next P-state (P2) has a value of 50, it implies that P1 performance is 25% lower than P0 performance, and P2 performance is 50% lower than P0 performance. •...
  • Page 701 PAL_PTCE_INFO PAL_PTCE_INFO – Get PTCE Purge Loop Information (6) Returns information required for the architected loop used to purge (initialize) the Purpose: entire TC. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_PTCE_INFO within the list of PAL procedures.
  • Page 702 PAL_REGISTER_INFO PAL_REGISTER_INFO – Return Information about Implemented Processor Registers (39) Returns information about implemented Application and Control Registers. Purpose: Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_REGISTER_INFO within the list of PAL procedures. info_request Unsigned 64-bit integer denoting what register information is requested.
  • Page 703 PAL_RSE_INFO PAL_RSE_INFO – Get RSE Information (19) Returns information about the register stack and RSE for this processor Purpose: implementation. Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_RSE_INFO within the list of PAL procedures. Reserved Reserved Reserved...
  • Page 704 PAL_SET_HW_POLICY PAL_SET_HW_POLICY – Set Current Hardware Resource Sharing Policy (49) Sets the current hardware resource sharing policy of the processor. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_SET_HW_POLICY within the list of PAL procedures. policy Unsigned 64-bit integer specifying the hardware resource sharing policy the caller is setting.
  • Page 705 PAL_SET_HW_POLICY Table 11-116. Processor Hardware Sharing Policies (Continued) Value Name Description High-priority The processor configures hardware resources to provide the logical processor this procedure was called on a greater share of the competing hardware resources. All competing logical processors will get a smaller share of the competing hardware resources. Exclusive High-priority The processor configures hardware resources such that the logical processor this procedure was called on has a greater share of the competing hardware resources.
  • Page 706 PAL_SET_PSTATE PAL_SET_PSTATE – Request Processor to Enter Power/Performance State (263) To request a processor transition to a given P-state. Purpose: Stacked Registers Calling Conv: Physical and Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_SET_PSTATE within the list of PAL procedures. p_state Unsigned integer denoting the processor P-state being requested.
  • Page 707 PAL_SET_PSTATE coordination. A subsequent call to PAL_SET_PSTATE on any logical processor in the dependency domain (with a force_pstate argument of zero) reinstates hardware coordination. The force_pstate argument is ignored on SCDD and HIDD logical processors. Calling this procedure on some processor implementations may affect P-states of other processors in the same dependency domain.
  • Page 708 PAL_SHUTDOWN PAL_SHUTDOWN – Shutdown the Processor (45) Put the logical processor into a low power state which can be exited only by a reset Purpose: event. Static Registers Only Calling Conv: Physical Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_SHUTDOWN within the list of PAL procedures. notify_platform 8-byte aligned physical address pointer providing details on how to optionally notify the platform that the processor is entering a shutdown state.
  • Page 709 PAL_TEST_INFO PAL_TEST_INFO – Information for Processor Self-test (37) Returns the alignment and size requirements needed for the memory buffer passed to Purpose: the PAL_TEST_PROC procedure as well as information on self-test control words for the processor self-tests. Static Registers Only Calling Conv: Physical Mode:...
  • Page 710 PAL_TEST_PROC PAL_TEST_PROC – Perform a Processor Self-test (258) Performs the second phase of processor self test. Purpose: Stacked Registers Calling Conv: PAL_TEST_PROC may modify some registers marked unchanged in the Stacked Register calling convention. See additional description below. Physical Mode: Not dependent Buffer: Arguments:...
  • Page 711 PAL_TEST_PROC • test_phase defines which phase of the processor self-tests are requested to be run. A value of zero indicates to run phase two of the processor self-tests. Phase two of the processor self-tests are ones that require external memory to execute correctly. A value of one indicates to run phase one of the processor self-tests.
  • Page 712 PAL_TEST_PROC with the exception of the translation caches, which are evicted as a result of testing. PAL_TEST_PROC is free to invalidate all cache contents. If the caller depends on the contents of the cache, they should be flushed before making this call. PAL_TEST_PROC requires that the RSE is set up properly to handle spills and fills to a valid memory location if the contents of the register stack are needed.
  • Page 713 PAL_VERSION PAL_VERSION – Get PAL Version Number Information (20) Returns PAL version information. Purpose: Static registers only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VERSION within the list of PAL procedures. Reserved Reserved Reserved Returns:...
  • Page 714 PAL_VM_INFO PAL_VM_INFO – Get Virtual Memory Information (7) Return information about the virtual memory characteristics of the processor Purpose: implementation. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VM_INFO within the list of PAL procedures. tc_level Unsigned 64-bit integer specifying the level in the TLB hierarchy for which information is required.
  • Page 715 PAL_VM_PAGE_SIZE PAL_VM_PAGE_SIZE – Get Virtual Memory Page Size Information (34) Returns page size information about the virtual memory characteristics of the processor Purpose: implementation. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VM_PAGE_SIZE within the list of PAL procedures.
  • Page 716 PAL_VM_SUMMARY PAL_VM_SUMMARY – Get Virtual Memory Summary Information (8) Returns summary information about the virtual memory characteristics of the processor Purpose: implementation. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VM_SUMMARY within the list of PAL procedures. Reserved Reserved Reserved...
  • Page 717 PAL_VM_SUMMARY Figure 11-49. Layout of vm_info_2 Return Value 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 max_purges rid_size impl_va_msb 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 Reserved •...
  • Page 718 PAL_VM_TR_READ PAL_VM_TR_READ – Read a Translation Register (261) Reads a translation register. Purpose: Stacked Registers Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VM_TR_READ within the list of PAL procedures. reg_num Unsigned 64-bit number denoting which TR to read. tr_type Unsigned 64-bit number denoting whether to read an ITR (0) or DTR (1).
  • Page 719 PAL_VP_CREATE PAL_VP_CREATE – PAL Create New Virtual Processor (265) Initializes a new vpd for the operation of a new virtual processor in the virtual Purpose: environment. Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_CREATE within the list of PAL procedures 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) host_iva 64-bit host virtual pointer to the host IVT for the virtual processor...
  • Page 720 PAL_VP_CREATE This procedure returns unimplemented procedure when virtual machine features are disabled. See Section 3.4, “Processor Virtualization” on page 2:44 “PAL_PROC_GET_FEATURES – Get Processor Dependent Features (17)” on page 2:446 for details. 2:472 Volume 2, Part 1: Processor Abstraction Layer...
  • Page 721 PAL_VP_ENV_INFO PAL_VP_ENV_INFO – PAL Virtual Environment Information (266) Returns the parameters needed to enter a virtual environment. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_ENV_INFO within the list of PAL procedures Reserved Reserved Reserved Returns:...
  • Page 722 PAL_VP_ENV_INFO Table 11-118. vp_env_info – Virtual Environment Information Parameter Field Description Reserved 31:11 Reserved probe If 1, processor supports interception of probe instructions. See Section 11.7.4.2.8, “Probe Instruction Virtualization” on page 2:344 for details on the usage of this control. If 0, intercept of probe instructions is not supported.
  • Page 723 PAL_VP_EXIT_ENV PAL_VP_EXIT_ENV – PAL Exit Virtual Environment (267) Allows a logical processor to exit a virtual environment. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_EXIT_ENV within the list of PAL procedures Optional 64-bit host virtual pointer to the IVT when this procedure is done Reserved Reserved Returns:...
  • Page 724 PAL_VP_INFO PAL_VP_INFO – PAL Virtual Processor Information (50) Returns information about virtual processor features. Purpose: Static Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VP_INFO within the list of PAL procedures feature_set Feature set information is being requested for. vp_buffer An address to an 8-byte aligned memory buffer (if used).
  • Page 725 PAL_VP_INFO get the vmm_id, although vmm_id is also returned for any other implemented feature sets as well. For feature_set 0, the vp_buffer argument is ignored. Volume 2, Part 1: Processor Abstraction Layer 2:477...
  • Page 726 PAL_VP_INIT_ENV PAL_VP_INIT_ENV – PAL Initialize Virtual Environment (268) Allows a logical processor to enter a virtual environment. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_INIT_ENV within the list of PAL procedures config_options 64-bit vector of global configuration settings –...
  • Page 727 PAL_VP_INIT_ENV processors in the virtual environment must specify the same value in the config_options parameter during PAL_VP_INIT_ENV, otherwise processor operation is undefined. Table 11-119. config_options – Global Configuration Options Field Description Global initialize If 1, this procedure will initialize the PAL virtual environment buffer for Configuration this virtual environment.
  • Page 728 PAL_VP_INIT_ENV Table 11-119. config_options – Global Configuration Options (Continued) Field Description Global opcode This bit must be set to 1 – opcode information will be provided to the Virtualization VMM during PAL intercepts within the virtual environment. This opcode Optimizations may or may not be guaranteed to be the opcode that triggered the intercept.
  • Page 729 PAL_VP_REGISTER PAL_VP_REGISTER – PAL Register Virtual Processor (269) Register a different host IVT and/or a different optional virtualization intercept handler Purpose: for the virtual processor specified by vpd. Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_REGISTER within the list of PAL procedures 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) host_iva...
  • Page 730 PAL_VP_REGISTER • Relocate the host IVT associated with the virtual processor. • Specify a different optional virtualization intercept handler for the virtual processor. This procedure returns unimplemented procedure when virtual machine features are disabled. See Section 3.4, “Processor Virtualization” on page 2:44 “PAL_PROC_GET_FEATURES –...
  • Page 731 PAL_VP_RESTORE PAL_VP_RESTORE – PAL Restore Virtual Processor (270) Restores virtual processor state for the specified vpd on the logical processor. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_RESTORE within the list of PAL procedures. 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD.) Reserved Reserved...
  • Page 732 PAL_VP_SAVE PAL_VP_SAVE – PAL Save Virtual Processor (271) Saves virtual processor state for the specified vpd on the logical processor. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_SAVE within the list of PAL procedures 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) Reserved Reserved...
  • Page 733 PAL_VP_TERMINATE PAL_VP_TERMINATE – PAL Terminate Virtual Processor (272) Terminates operation for the specified virtual processor. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_TERMINATE within the list of PAL procedures 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) Optional 64-bit host virtual pointer to the IVT when this procedure is done Reserved Returns:...
  • Page 734 11.11 PAL Virtualization Services In order to support efficient handling of interruptions when PSR.vm was 1, a set of PAL virtualization services is defined to allow certain high-frequency PAL functions to be performed in a low-latency and low-overhead manner. Upon successful completion of PAL_VP_INIT_ENV, the virtual base address of the PAL virtualization services (VSA) is returned to the VMM.
  • Page 735 Table 11-121. State Requirements for PSR for PAL Virtualization Services PSR Bit Description Value big-endian memory access enable user performance monitor enable alignment check floating-point registers f2-f31 written floating-point registers f32-f127 written interruption state collection enable interrupt enable protection key validation enable data address translation enable disabled FP register f2 to f31 disabled FP register f32 to f127...
  • Page 736 c. Specific PAL services can be invoked with PSR.ic equal to 1 or 0. See the description of specific PAL services for details. d. Most PAL services can be invoked with PSR.bn equal to 1 or 0. e. Specific PAL services must be invoked with PSR.bn equal to 0. See the description of specific PAL services for details.
  • Page 737 PAL_VPS_RESUME_NORMAL PAL_VPS_RESUME_NORMAL – Resume Virtual Processor Normal (0x0000) Resumes the current virtual processor. This service is used when vpsr.ic is 1. This Purpose: service can also be used independent of the state of vpsr.ic if all virtualization accelerations and disables are disabled. Arguments: Argument Description...
  • Page 738 PAL_VPS_RESUME_NORMAL Table 11-122. Virtual Processor Settings in Architectural Resources for PAL_VPS_RESUME_NORMAL and PAL_VPS_RESUME_HANDLER Resource Description External Interrupt Control The external interrupt control registers contain the state of the virtual Registers processor if d_extint in Virtualization Disable Control (vdc) is 1. Otherwise the external interrupt control registers are virtualized by the VMM and contain VMM state.
  • Page 739 PAL_VPS_RESUME_NORMAL Table 11-123. Processor Status Register Settings for Virtual Processor Execution (Continued) Field Bits Description 33:32 Contains the cpl field of the virtual processor. VMM-specific. VMM-specific. Must be 1. VMM-specific. VMM-specific. VMM-specific. VMM-specific. 42:41 Contains the ri field of the virtual processor. Contains the ed bit of the virtual processor.
  • Page 740 PAL_VPS_RESUME_HANDLER PAL_VPS_RESUME_HANDLER – Resume Virtual Processor Handler (0x0400) Resumes the current virtual processor. This service is used when vpsr.ic is 0. Purpose: Arguments: Argument Description GR24 VBR0 GR25 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) GR26 Virtualization Acceleration Control (vac) field from the VPD specified in GR25 and CFLE setting at the target instruction.
  • Page 741 PAL_VPS_SYNC_READ PAL_VPS_SYNC_READ – Synchronize VPD State for Reads (0x0800) Synchronize VPD with the latest implementation-specific virtual architectural state. Purpose: Arguments: Argument Description GR24 64-bit host virtual return address GR25 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) GR26 Reserved GR27 Reserved...
  • Page 742 PAL_VPS_SYNC_WRITE PAL_VPS_SYNC_WRITE – Synchronize VPD State for Writes (0x0c00) Synchronize the implementation-specific virtual architectural state with VPD. Purpose: Arguments: Argument Description GR24 64-bit host virtual return address. GR25 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD.) GR26 Reserved GR27 Reserved GR28...
  • Page 743 PAL_VPS_SET_PENDING_INTERRUPT PAL_VPS_SET_PENDING_INTERRUPT – Register Highest Priority Pending Interrupt (0x1000) Register highest priority pending interrupt of the running virtual processor. Purpose: Arguments: Argument Description GR24 64-bit host virtual return address GR25 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) GR26 Reserved GR27...
  • Page 744 PAL_VPS_SET_PENDING_INTERRUPT PAL_VPS_SET_PENDING_INTERRUPT performs the following actions: • Copy the virtual highest priority pending interrupt from the VPD into implementation-specific resources. • Return to VMM by an indirect branch specified in the GR24 parameter. 2:496 Volume 2, Part 1: Processor Abstraction Layer...
  • Page 745 PAL_VPS_THASH PAL_VPS_THASH – Compute Long Format VHPT Entry Address (0x1400) Compute a long format VHPT entry address. Purpose: Arguments: Argument Description GR24 64-bit host virtual return address GR25 64-bit virtual address used to compute the hash entry address GR26 Region register value used to compute the hash entry address GR27 Virtual PTA GR28...
  • Page 746 PAL_VPS_TTAG PAL_VPS_TTAG – Compute Translated Hashed Entry Tag (0x1800) Compute the long format translated hashed entry tag. Purpose: Arguments: Argument Description GR24 64-bit host virtual return address GR25 64-bit virtual address used to compute the hash entry tag GR26 Region register value used to compute the hash entry tag GR27 Reserved GR28...
  • Page 747 PAL_VPS_RESTORE PAL_VPS_RESTORE – Fast Restore Virtual Processor State (0x1c00) Performs an implementation-specific light-weight restore operation for the specified Purpose: VPD on the logical processor. Arguments: Argument Description GR24 64-bit host virtual return address GR25 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) GR26 Skip implicit synchronization GR27...
  • Page 748 PAL_VPS_SAVE PAL_VPS_SAVE – Fast Save Virtual Processor State (0x2000) Performs an implementation-specific light-weight save operation for the specified VPD Purpose: on the logical processor. Arguments: Argument Description GR24 64-bit host virtual return address GR25 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) GR26 Skip implicit synchronization GR27...
  • Page 749 Part II: System Programmer’s Guide 2:501 Intel® Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 750 2:502 Intel® Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 751 About the System Programmer’s Guide Part II: System Programmer’s Guide is intended as a companion section to the information presented in Part I:, “System Architecture Guide”. While Part I provides a crisp and concise architectural definition of the Itanium instruction set, Part II provides insight into programming and usage models of the Itanium system architecture.
  • Page 752 Chapter 4, “Context Management” describes how operating systems need to preserve Itanium register contents. In addition to spilling and filling a register’s data value, the Itanium architecture also requires software to preserve control and data speculative state associated with that register, i.e. its NaT bit and ALAT state. This chapter also discusses system architecture mechanisms that allow an operating system to significantly reduce the number of registers that need to be spilled/filled on interruptions, system calls, and context switches.
  • Page 753 This chapter is of interest to platform firmware and operating system developers. Related Documents The following documents are referred to fairly often in this document. For more details on software conventions and platform firmware, please consult these manuals (available at http://developer.intel.com). ® ® [SWC] Intel Itanium...
  • Page 754 2:506 Volume 2, Part 2: About the System Programmer’s Guide...
  • Page 755 This chapter closes by describing how to correctly update code images to implement self-modifying code, cross-modifying code, and paging of code using programmed I/O. ® ® An Overview of Intel Itanium Memory Access Instructions The Itanium architecture provides load, store, and semaphore instructions to access memory.
  • Page 756 • Fence semantics combine acquire and release semantics (i.e. the instruction is made visible after all prior orderable instructions and before all subsequent orderable instructions). In the above definitions “prior” and “subsequent” refer to the program-specified order. An “orderable instruction” is an instruction that the memory ordering model can use to establish ordering relationships .
  • Page 757 specific opcode chosen. The xchg instruction always has acquire semantics. These instructions read a value from memory, modify this value using an instruction-specific operation, and then write the modified value back to memory. The read-modify-write sequence is atomic by definition. 2.1.3.1 Considerations for using Semaphores The memory location on which a semaphore instruction operates on must obey two...
  • Page 758 ® ® Memory Ordering in the Intel Itanium Architecture Understanding a system’s memory ordering model is key to writing either user- or...
  • Page 759 In the Itanium architecture, dependencies between operations by a processor have implications for the ordering of those operations at that processor. The discussion in Section 2.2.1.6 page 2:515 Section 2.2.1.7 page 2:516 explores this issue in greater depth. The following sections examine the Itanium ordering model in detail. Section 2.2.1 presents several memory ordering executions to illustrate important behaviors of the model.
  • Page 760 “X” and “Y” indicate any orderable instruction. ® ® 2.2.1.2 The Intel Itanium Architecture Provides a Relaxed Ordering Model The Itanium memory ordering model is a relaxed model. As a result, the Itanium architecture permits any outcome when executing the code shown in Table 2-1.
  • Page 761 Processor #0 operations M1 and M2 and the Processor #1 operations M3 and M4 from Table 2-1 execution as shown in Table 2-1. ® ® Table 2-2. Acquire and Release Semantics Order Intel Itanium Memory Operations Processor #0 Processor #1 [x] = 1 // M1 ld.acq...
  • Page 762 The Itanium ordering semantics always allow a processor to make operations that follow a release visible before the release and to make operations that precede an acquire visible after the acquire. Table 2-3. Loads May Pass Stores to Different Locations Processor #0 Processor #1 st.rel...
  • Page 763 This contradicts the postulated outcome r1 = 0 and r2 = 0 and thus the Itanium memory ordering model disallows the r1 = 1 and r2 = 0 outcome. Specifically, if M3 reads 0, then M4, M5, and M6 may not yet be visible but M1 and M2 must be visible. Thus, when M6 becomes visible it must observe x = 1 because M1 is already visible.
  • Page 764 2.2.1.7 Data Dependency Establishes Local Ordering In the Itanium architecture, a dependency (e.g., a later operation reading the value written by an earlier operation) can imply a local ordering relationship between the two operations. This section focuses on dependencies through registers only. Section 2.2.1.6 discusses dependencies and MP ordering.
  • Page 765 The Itanium architecture does not allow the outcome r1 = x and r2 = 0 in this execution either. Unlike the execution in Table 2-6, there is no direct dependency between the values that M3 produces and the values that M4 consumes. However, there is a RAW through register r1 from M3 to C1 and a RAW through register p1 from C1 to M4.
  • Page 766 2.2.1.8 Store Buffers May Satisfy Local Loads In the Itanium memory ordering model, store buffers (or other logically-equivalent structures) may satisfy local read requests from loads or acquire loads even if the stored data is not yet visible to other agents in the coherence domain. Such bypassing must honor any ordering semantics in the memory reference stream.
  • Page 767 to account for both the memory ordering semantics and dependencies. It is important to keep in mind that the observance of a dependency between two operations does not imply an ordering relationship (from the standpoint of the memory ordering model) between the operations as Section 2.2.1.6 describes.
  • Page 768 Like Section 2.2.1.8, the discussion in this section focuses on the outcome r1 = 1, r3 = 1, r2 = 0, and r4 = 0 because it is allowed if and only if store buffers can satisfy local loads. The line of reasoning to show that the outcome r1 = 1, r3 = 1, r2 = 0, and r4 = 0 is not allowed in Table 2-11 is similar to the reasoning used to show that this outcome...
  • Page 769 Table 2-12. Bypassing to a Semaphore Operation Processor #0 Processor #1 r5 = 2 r6 = 2 st.rel [x] = 1 // M1 st.rel [y] = 1 // M4 xchg r1 = [x], r5 // M2 xchg r3 = [y], r6 // M5 r2 = [y] // M3...
  • Page 770 A store buffer may not provide a local read operation early access to a value written by a semaphore operation. Therefore, the outcome r1 = 1, r3 = 1, r2 = 0, r4 = 0, r5 = 0, and r6 = 0 in the Table 2-13 execution is not allowed.
  • Page 771 The fact that the store to x is a release store implies that, since there is a causal relationship from M1 to M3, M1 must become visible to processor #2 before M3. ® ® Table 2-15. Intel Itanium Architecture Obeys Causality Processor #0 Processor #1 Processor #2 st.rel [x] = 1 // M1...
  • Page 772 2.2.2 Memory Attributes In addition to the ordering semantics and data dependencies, the memory attributes of the page that is being referenced also influence access ordering and visibility. Using memory attributes allows the Itanium architecture to match the performance and the usage model to the type of device (e.g.
  • Page 773 2.2.3 Understanding Other Ordering Models: Sequential Consistency and IA-32 To provide a point of reference, it is helpful to understand other memory ordering models. These ordering models affect not only the programmer’s view of the system, but also the overall system performance and design. Processors with relaxed memory ordering models may achieve higher performance than those with strict ordering models.
  • Page 774 For example, consider the example shown in Figure 2-3. ® Figure 2-3. Why a Fence During Context Switches is Required in the Intel ® Itanium Architecture // Process A begins executing on Processor #0... ld.acq...
  • Page 775 2.4.1 Spin Lock Software commonly uses spin locks to guard access to a critical region of code. In these locks, the software “spins” while waiting for a shared lock variable to indicate that the critical region can be safely accessed. Typically, the lock code uses atomic operations such as compare and exchange or fetch and add to update the shared lock variable.
  • Page 776 2.4.2 Simple Barrier Synchronization A barrier is a common synchronization primitive used to hold a set of processes at a particular point in the program (the barrier) until all processors reach the location. Once all processes arrive at the barrier, they may all continue to execute. Figure 2-5 shows a sense-reversing barrier synchronization based on the fetchadd instruction from Hennessy and Patterson [HP96].
  • Page 777 indicates the value that release must have before the processor can leave the barrier. The last processor to arrive at the barrier releases the other processors by setting release to the new local_sense value. The mf instruction in Figure 2-5 is necessary only if the programmer wishes to ensure that memory operations performed before the barrier code are visible to memory operations performed by any processor after the barrier code.
  • Page 778 Figure 2-6. Dekker’s Algorithm in a 2-way System // The flag_me variable is zero if we are not in the // synchronization and critical section code and non-zero // otherwise; flag_you is similarly set for the other processor. // This algorithm does not retry access to the // resource if there is contention.
  • Page 779 Figure 2-7. Lamport’s Algorithm // The proc_id variable holds a unique, non-zero id for the process that // attempts access to the critical section. x and y are the synchronization // variables that indicate who is in the critical section and who is // attempting entry.
  • Page 780 • Programmed I/O for paging of code pages. • DMA for paging of code pages. The next four sections discuss these techniques in greater depth. To illustrate the code sequences for self- and cross-modifying code, the examples in this section use the syntax “st [foo] = new” to represent a group of aligned stores that change the instruction at address foo to the instruction “new”.
  • Page 781 2.5.2 Cross-modifying Code Consider a multi-threaded program for a multiprocessor system that dynamically updates some procedure that any processor in the system may execute. The program maintains several disjoint buffers to hold the new code and requires a processor to execute an IP-relative branch instruction at some address x to reach the code.
  • Page 782 The release store ensures that the code image updates are made visible to the remote processors in the proper order (i.e. new_code is updated before the branch at address x is updated). Using the final three instructions ensures that the remote processors will see the new code the next time they execute the branch at address x.
  • Page 783 Figure 2-10. Updating a Code Image on a Remote Processor patch_l_and_r: [code] = new_inst // write new instruction fc.i code ;; // flush new instruction sync.i ;; // sync i stream with store // If the local processor must ensure that remote processors see // the preceding memory updates before any subsequent memory // operations, the following code is also necessary.
  • Page 784 Finally, software may also eliminate the mf or srlz.i instructions if it guarantees that these operations will take place elsewhere (e.g. in the operating system) before the processor attempts to execute the updated code. For example, context switch routines must contain a memory fence (see Section 2.3 on page page...
  • Page 785 Interruptions and Serialization This chapter discusses the interruption and serialization model. Although the Itanium architecture is an explicitly parallel architecture, faults and traps are delivered in program order based on IP, and from left-to-right in each instruction group. In other words, faults and traps are reported precisely on the instruction that caused them.
  • Page 786 • When an external or independent agent (I/O device, timer, another processor) requires attention from the processor, an interrupt occurs. There are several types of interrupts. An initialization interrupt occurs when the processor has received an initialization request. A Platform Management Interrupt (PMI) can be generated by the platform to request features such as power management.
  • Page 787 instruction address translation is disabled, the IVA register should contain the physical address of the base of the IVT. Software must further ensure that instruction and memory references from low-level interruption handlers do not generate additional interruptions until enough state has been saved and interruption collection can be re-enabled.
  • Page 788 Debug breakpoints, lower-privilege interception, taken branch and single step trapping are disabled. Current privilege level becomes most privileged. Intel Itanium Instruction set. Handlers execute Intel Itanium instructions. id, da, ia, dd, ed Instruction/data debug, access bit and speculation deferral bits are disabled.
  • Page 789 A processor based on the Itanium architecture provides the following interruption registers for collecting information about the latest interruption or the state of the machine at the time of the interruption: • IPSR – A copy of the processor status register (PSR) at the moment the interruption occurred.
  • Page 790 “Interruption Vector Descriptions” for details. Software can use the instruction bundle information for debug and emulation purposes. No other architectural state is modified when an interruption occurs. Note that only IIP, IPSR, ISR, and IFS are written by all interruptions (assuming PSR.ic is 1 at the time of interruption);...
  • Page 791 For example, assume that GR2 contains the new value for IVA and that PSR.i is 1. To modify the IVA register, software would perform the following code sequence, where the code page is mapped by an instruction translation register or instruction translation is disabled: rsm psr.i // external interrupts disabled upon next instruction...
  • Page 792 A typical lightweight interruption handler can operate completely out of register bank 0. If the bank 0 registers provide sufficient storage for the handler, none of the interrupted context’s register state need be saved to memory, and the handler does not need to use stacked registers.
  • Page 793 4. Allocate a “trap frame” to store the interrupted context’s state on the kernel memory stack, and move the interruption state (IIP, IPSR, IIPA, ISR, IFA, IFS, IIB0-1), the interrupted memory stack pointer and the interrupted predicate registers from the banked registers to the trap frame. 5.
  • Page 794 ssm 0x4000 ;; // Set PSR.i There is no need to explicitly serialize the PSR.i update, unless there is a requirement to force sampling of external interrupts right away. Without the serialization, the PSR.i update will occur at the very latest when the next exception causes an implicit instruction serialization to occur.
  • Page 795 heavyweight interruption handler), we say that a nested interruption has occurred. On a nested interruption (other than a Data Nested TLB fault) only ISR is updated by the hardware. All other interruption registers preserve their pre-interruption contents. With the exception of the Data Nested TLB fault, the Itanium architecture does not support nested interruptions.
  • Page 796 2:548 Volume 2, Part 2: Interruptions and Serialization...
  • Page 797 4-1, software is required to use different state preservation methods depending on the type of register. More details on register preservation are provided in the next two sections. ® ® Table 4-1. Preserving Intel Itanium General and Floating-point Registers Floating-point State Components...
  • Page 798 4.1.1 Preserving General Registers The Itanium general register file is partitioned into two register sets: GR0-31 are termed the static general registers and GR32-127 are termed the stacked general registers. Typically, st8.spill and ld8.fill instructions are used to preserve the static GRs, and the processor’s register stack engine (RSE) automatically preserves the stacked GRs.
  • Page 799 4.1.2 Preserving Floating-point Registers The Itanium architecture encodes a floating-point register’s control speculative state as a special unnormalized floating-point number called NaTVal. As a result, Itanium floating-point registers do not have a NaT bit. The architecture provides the stf.spill and ldf.fill instructions to save and restore floating-point register values and control speculative state.
  • Page 800 In principal, preserved GRs and FRs need not be spilled/filled when entering the kernel. Whatever function is called from the low-level interruption handler or the system call entry point will itself observe the calling conventions and preserve the registers. The only occasion when preserved registers need to be spilled/filled is on a process or thread context switch.
  • Page 801 Automatic preservation offers performance benefits: the register stack may contain only a handful of dirty registers, system call parameters can be passed on the register stack, and, upon return to the interrupted context the loadrs instruction only needs to restore registers that were actually spilled to memory. Since system call rates scale with processor performance, the RSE offers a key method for reducing the kernel’s execution time of a system call.
  • Page 802 two “disabled” bits, PSR.dfl and PSR.dfh, are accessible to the privileged software alone. Setting a “disabled” bit causes a fault into the disabled-fp vector upon first use (read or write) of the corresponding register set. As mentioned earlier, an involuntary kernel entry (e.g. interruption) needs to preserve all scratch floating-point registers.
  • Page 803 never accessible to software during the system call (see Section 4.2.2 for details). This works, because at the system call entry user-code may not have any dependencies on the state of the scratch registers. System Calls Reducing the overhead associated with system calls becomes more important as processor efficiency increases.
  • Page 804 the epc until the switch to the kernel backing store has been completed. Additionally, low-level operating system handlers should not only use IPSR.cpl, but should also check BSPSTORE, to determine whether they are running on the kernel backing store (imagine an external interrupt being delivered on the first instruction after the epc). 4.4.2 break/rfi The break instruction, when issued in the i, f, and m syllables, specifies an arbitrary...
  • Page 805 Context Switching This section discusses context switching at the user and kernel levels. 4.5.1 User-level Context Switching 4.5.1.1 Non-local Control Transfers (setjmp/longjmp) A non-local control transfer such as the C language setjmp()/longjmp() pair requires software to correctly handle the register stack and the RSE. The register stack provides the BSP application register which always contains the backing store address of the current GR32.
  • Page 806 Write RSC with setjmp_rsc. d. Write PFS with setjmp_bsp. 6. Restore setjmp()’s return IP into BR7. 7. Return from longjmp() into setjmp()’s caller using br.ret instruction. 4.5.1.2 User-level Co-routines The following steps need to be taken to execute a voluntary user-level thread switch. 1.
  • Page 807 5. Restore the default control register (DCR) of the inbound context (if the DCR is maintained on a per-process basis). 6. Restore the contents of the protection key registers associated with the inbound context. § Volume 2, Part 2: Context Management 2:559...
  • Page 808 2:560 Volume 2, Part 2: Context Management...
  • Page 809 Memory Management This chapter introduces various memory management mechanisms of the Itanium architecture: region register model, protection keys, and the virtual hash page table usage models are described. This chapter also discusses usage of the architecture translation registers and translation caches. Outlines are provided for common TLB and VHPT miss handlers.
  • Page 810 region register; they are not inserted into the TLB. Likewise, when software purges a translation from the processor's TLBs, the VRN bits of the address used for the purge are used only to index the corresponding region register and are not used to find a matching translation.
  • Page 811 In a MAS OS, the RID bits act as an address space identifier or tag. For each process-private region, a unique RID is assigned to that process by the OS. If a process needs multiple process-private regions (e.g. the process requires a private 64-bit address space), the OS assigns multiple unique RIDs for each such region.
  • Page 812 5.1.2 Protection Keys The Itanium architecture provides two mechanisms for applying protection to pages. The first mechanism is the access rights bits associated with each translation. These bits provide privilege level-granular access to a page. The second mechanism is the protection keys.
  • Page 813 running, the OS will insert a valid PKR with the protection key 0xA and the ‘rd’ bit cleared, to allow this process to read from the page. However, the ‘wd’ bit for this PKR will be set when the consumer process is running to prevent it from writing the page. The processor hardware has no notion of which protection keys belong to which process.
  • Page 814 The TCs are treated as a set associative cache and are not addressable by software. The TC replacement policy is determined by software. All processor models implement at least 8 instruction and 8 data TRs, and at least 1 instruction and 1 data TC entry. Software inserts translations into the TLBs via insertion instructions.
  • Page 815 6. Using the general registers from steps 4 and 5, execute the itr.i or itr.d instruction. A data or instruction serialization operation must be performed after the insert (for itr.d or itr.i, respectively) before the inserted translation can be referenced. Software may insert a new translation into a TR slot already occupied by another valid translation.
  • Page 816 The size, associativity, and replacement policy of the TC array are implementation-dependent. With the exception of the forward progress rules defined in Section 4.1.1.2, “Translation Cache (TC)” on page 2:49, software cannot depend on the existence or life-span of a TC translation, as a TC entry may be replaced or invalidated by the hardware at any time.
  • Page 817 A data or instruction serialization operation must be performed after the ptc.l before the translation is guaranteed to be no longer visible to the local data or instruction stream, respectively. The ptc.l instruction does not modify the page tables nor any other memory location, nor does it affect the TLB state of any processor other than the one on which it is executed.
  • Page 818 5.2.2.2.3 ptc.g, ptc.ga The Itanium architecture supports efficient global TLB shootdowns via the ptc.g and ptc.ga instructions. These instructions obviate the need for performing inter-processor interrupts to maintain TLB coherence in a multiprocessor system. A TLB coherence domain is defined as a group of processors in a multiprocessor system which maintain TLB coherence via hardware.
  • Page 819 The ptc.ga variant of the global purge instruction behaves just like the ptc.g variant, but it also removes any ALAT entries which fall into the address range specified by the global shootdown from all remote processors’ ALATs. The ptc.ga variant is intended to be used whenever a translation is remapped to a different physical address to ensure that any stale ALAT entries are invalidated.
  • Page 820 tables, or as a primary page table with collision chains. The long format VHPT is a much better representation for address spaces that are sparsely populated, since the short format VHPT has a linear layout and would consume a large amount of memory.
  • Page 821 5.3.2 Long Format The long format VHPT is organized as a hash table which contains a subset of all translation entries. The long format VHPT entries contain a 8-byte field that is ignored by the VHPT walker and can be used by the operating system to link VHPT entries to software-walkable hash collision chains if it uses the VHPT as its primary page table.
  • Page 822 Since the VHPT walker may abort a walk at any time and raise these faults, software must always be able to handle all TLB faults, even when the VHPT walker is enabled. Upon entry to these fault handlers, the IHA, ITIR, and IFA control registers are initialized by the hardware as follows: •...
  • Page 823 5.4.2 VHPT Translation Vector Processors based on the Itanium architecture does not perform recursive TLB hardware page walks. Since the VHPT is itself a virtually addressed structure, each reference performed by the walker itself goes through the TLBs and may miss. These faults are raised when the VHPT walker is enabled, but the walker misses the TLBs when attempting to service a TLB miss caused by the program.
  • Page 824 For a long format VHPT, additional steps are required to load bytes 16-23 of the VHPT entry and check for the correct tag; see Section 5.4.1 for more details. A separate structure other than the VHPT may be used to back VHPT translations, in which case the handler would not use the thash instruction to generate the address of the translation mapping the VHPT entry corresponding to the original faulting address.
  • Page 825 The processor will not deliver a Data Nested TLB fault when PSR.ic is in-flight; Data Nested TLB faults are only delivered when PSR.ic is 0. If PSR.ic is in-flight, any data references which miss the TLB and trigger a fault will raise a Data TLB fault, and the processor will set ISR.ni to 1.
  • Page 826 Figure 5-2. Subpaging Sub-table Native Page Table 16K PTE 4K PTE 16K PTE 4K PTE 4K PTE 001 1 4K PTE 16K PTE 16K PTE When one of the subdivided pages is referenced and does not have a translation in the TLB, a TLB miss will occur.
  • Page 827 Runtime Support for Control and Data Speculation An Itanium architecture-based operating system needs to handle exceptions generated by control speculative loads (ld.s or ld.sa), data speculative loads (ld.a) and architectural loads (ld) in different ways. Software does not have to worry about control or data speculative loads potentially hitting uncacheable memory with side-effects, since ld.s, ld.sa, and ld.a instructions to non-speculative memory are always deferred by the processor for details refer to Section 4.4.6, “Speculation Attributes”...
  • Page 828 Details on these three models are discussed in the next three sections as well as in Section 5.5.5, “Deferral of Speculative Load Faults” on page 2:105. 6.1.1 Hardware-only Deferral Hardware only deferral is configured by setting all speculation deferral bits in the DCR register (dd, da, dr, dx, dk, dp and dm) to 1.
  • Page 829 • ITLB.ed=0 (no control speculative recovery code): The compiler generates recovery code only for ld.sa and ld.a instructions that have speculatively executed uses. Speculation failure of ld.sa and ld.a instructions that have no speculatively executed uses can be recovered by a ld.c instruction, and hence do not require recovery code.
  • Page 830 The following pseudo code outlines the basic steps for an unaligned reference handler: 1. Ensure that only ISR.r is 1, and that ISR.w, ISR.x, and ISR.na are 0. 2. Inspect the ISR.sp and ISR.ed. If both are 1, then defer this control speculative load by setting IPSR.ed and rfi-ing.
  • Page 831 Instruction Emulation and Other Fault Handlers This chapter introduces several common emulation handlers that an Itanium architecture-based operating system must support. A general overview is given for: • Unaligned Reference Handler – emulation of misaligned memory references that the processor hardware cannot handle, or has been configured to fault on. •...
  • Page 832 Unsupported Data Reference Handler Processors based on the Itanium architecture do not support all types of memory references to all memory attributes. In particular: • Semaphore operations to uncacheable memory are not supported. For details consult Section 2.1.3.2, “Behavior of Uncacheable and Misaligned Semaphores” on page 2:509.
  • Page 833 (movl), they encode their immediate in the L and the X slot of the bundle. The Intel Itanium processor does not support the long branch instruction, brl, and requires the operating system to emulate its behavior. When an Itanium processor encounters a brl instruction, it vectors to the Illegal Operation Fault handler, regardless of the branches’...
  • Page 834 specified in the brl.call instruction with the IP of the successor of the brl.call (predication helps here as the Itanium instruction set does not provide an indirect move to branch register instruction). • The handler forms the 60-bit immediate IP-offset for the brl target from the i and imm20 fields from the X syllable of the bundle (the brl instruction) and the imm39 field from the L syllable of the bundle.
  • Page 835 754-1985 for Binary Floating-point Arithmetic (IEEE-754). It is useful in creating and maintaining floating-point exception handling software by operating system writers. ® ® Floating-point Exceptions in the Intel Itanium Architecture Floating-point exception handling in the Itanium architecture has two major responsibilities.
  • Page 836 SWA Faults, is limited to the scalar reciprocal and scalar reciprocal square-root approximation instructions and is not implementation dependent. It is required for the correctness of the divide and square root algorithms. 8.1.1.1 SWA Faults The Itanium architecture allows an implementation to raise SWA faults as required. Therefore an implementation-independent operating system must be able to emulate the architectural behavior of all FP instructions that can raise a floating-point exception.
  • Page 837 Inexact. This is a trivial case for the SWA Trap handler, since result of the second IEEE rounding is identical to the first IEEE rounding. ® Figure 8-1. Overview of Floating-point Exception Handling in the Intel ® Itanium Architecture...
  • Page 838 input/output register specifiers. 3. From the ISR.code and FPSR trap enable controls, determine if a SWA Trap has occurred, if not go to the last step. 4. Read the first IEEE rounded result from the FR output register. 5. From the opcode and the status field, decode the result range and precision. 6.
  • Page 839 At the application level, a user floating-point exception handler could handle the Itanium floating-point exception directly. This is the traditional operating system approach of providing a signal handler with a pointer to a machine-dependent data structure. It would be more convenient for the application developer if the operating system were to first transform the results to make them IEEE-754 conforming and then present the exception to the user in an abstracted manner.
  • Page 840 8.1.2.3 Denormal/Unnormal Operand Exception (Fault) The exception-enabled response of the Itanium arithmetic instruction to a Denormal/Unnormal Operand exception is to leave the operands unchanged and to set the D bit in the ISR.code field of the ISR register. The operating system kernel, reached via the floating-point fault vector, will then invoke the user floating-point exception handler, if one has been registered.
  • Page 841 Just as for overflow, the actual scaling of the result is not performed by the Itanium architecture. It is typically performed by the IEEE Filter, which is invoked before calling the user floating-point exception handler. 8.1.2.6 Inexact Exception (Trap) The exception-enabled response of an Itanium arithmetic instruction to an Inexact exception is to set the I bit (and possibly the FPA bit) in the ISR.code field of the ISR register and the Inexact flag in the appropriate status field of the FPSR register.
  • Page 842 2:594 Volume 2, Part 2: Floating-point System Software...
  • Page 843 IA-32 Application Support The Itanium architecture enables Itanium architecture-based operating systems to host IA-32 applications, Itanium architecture-based applications, as well as mixed IA-32/Itanium architecture-based applications. Unless the operating system explicitly intercepts ISA transfers (using the PSR.di), user-level code can transition between the two instruction sets without operating system intervention.
  • Page 844 As mentioned earlier, user-level code can transition from Itanium to IA-32 (or back) instruction sets without operating system intervention. As described in Chapter 6, ® ® “IA-32 Application Execution Model in an Intel Itanium System Environment” in Volume 1, two instructions are provided for this purpose: br.ia (an Itanium unconditional branch), and JMPE (an IA-32 register indirect and absolute jump).
  • Page 845 IA-32 return address (address of the IA-32 instruction following the JMPE itself) in IA_64 register GR1. ® ® 9.1.4 Procedure Calls between Intel Itanium and IA-32 Instruction Sets If procedure call linkage is required between Itanium architecture-based and IA-32 subroutines, software needs to perform additional work as described in the next two sections.
  • Page 846 4. Make sure JMPE knows where to return to, e.g. deposit return address for the JMPE on memory stack or pass it in an IA-32 visible register. 5. Setup IA-32 branch target in branch register. 6. Flush register stack, but no other RSE updates. 7.
  • Page 847 11. Ensure memory stack pointer is correctly aligned prior to returning to IA-32 code. 12. br.ia returns to IA-32 caller. IA-32 Architecture Handlers An Itanium architecture-based operating system needs to be prepared to handle exceptions from Itanium architecture-based and IA-32 code. Depending on the exception cause, exception vectors can be: •...
  • Page 848 ® Table 9-1. IA-32 Vectors that need Itanium Architecture-based OS Support (Continued) Vector (IVA offset) Exception Name Exception Related To Expected OS Behavior IA-32 Taken Branch trap Debug Relay to debugger. IA-32 Single Step trap Debug Relay to debugger. IA-32 Invalid Opcode fault Bad Opcode Signal application.
  • Page 849 making the reference has completed. Since IA-32 instruction can make multiple memory references, a single IA-32 instruction may cause multiple data break points to trigger. Details on how this is communicated to software in the interrupt status register (ISR) is given in Section 9.1, “IA-32 Trap Code”...
  • Page 850 2:602 Volume 2, Part 2: IA-32 Application Support...
  • Page 851 Itanium architecture can fully leverage the large set of existing platform infrastructure and I/O devices, compatibility with existing platform infrastructure is provided in the form of direct support for Intel 8259A compatible interrupt controllers and limited support for level sensitive interrupts.
  • Page 852 • From external sources, e.g. external interrupt controllers or intelligent external I/O devices, or • From the processor’s LINT0 or LINT1 pins (typically connected to an Intel 8259A compatible interrupt controller), or • From internal processor sources, e.g. timers or performance monitors, or •...
  • Page 853 the way out of an uninterruptable code section software is not required to serialize the setting of PSR.i either, unless it is of interest to software to be able to take interrupts in the very next instruction group. A code example for this case is given below: rsm i ;;...
  • Page 854 10.4 External Interrupt Delivery The architectural interrupt model in Section 5.8 defines how each interrupt vector cycles through one of four states: • Inactive: there is no interrupt pending on this vector. • Pending: an interrupt has been received by the processor on this vector, but has not been accepted by the processor and has not been acquired by software.
  • Page 855 Software must preserve IIP and IPSR prior to re-enabling PSR.ic and PSR.i which will re-enable taking of exceptions and higher priority external interrupts. d. Issue a srlz.d instruction. This ensures that updated PSR.ic and PSR.i settings are visible, and it also makes sure that the IVR read side effect of masking lower or equal priority interrupts is visible when PSR.i becomes 1.
  • Page 856 10.5.1 Notation Preprocessor macros for function ENTRY and END are used in the examples to reduce duplication of code and reduce document space requirements. #define ENTRY(label) \ .text; \ .align 32;; \ .global label; \ .proc label; \ label:: #define END(label) .endp 10.5.2 TPR and XPTR Usage Example This code will allow certain interrupts to be masked by increasing/decreasing the task...
  • Page 857 10.5.3 EOI Usage Example This example is a typical return from an interrupt service routine to the generic interrupt handler. Interrupts are disabled before returning to the main trap handler in preparation for returning from kernel space. return_from_interrupt: // disable interrupts here rsm 0x4000 // make sure interrupts disabled // interrupt_eoi# clear the sapic/pic interrupt...
  • Page 858 The Interval Time Counter (ITC) gets updated at a fixed relation to the processor clock. The ITM, Interval Timer Match, is used to determine when a interval timer interrupt is generated. When the ITC matches the ITM and the timer is unmasked via ITV then an interrupt will be generated.
  • Page 859 the time-out value. In this case the ITM has to be adjusted in order for the next ITM to be accurate. The following algorithm could be used to adjust the next ITM before returning from the timer interrupt handler. for (;;) { itm_next = itm_next + timeout_delta + (read current ITC - read current ITM);...
  • Page 860 10.5.9 INTA Example External interrupt controllers, that are compatible with the Intel 8259A interrupt controller can not issue interrupt messages, so the vector number is not available at the time of the interrupt request. When an interrupt is accepted the software must check to see if it came from an external controller by the vector number (via IVR) to see if it is the ExtINT vector.
  • Page 861 // A single byte load from the INTA address should cause // the processor to emit the INTA cycle on the processor // system bus. Any Intel 8259A compatible external interrupt // controller must respond with the actual interrupt // vector number as the data to be loaded.
  • Page 862 2:614 Volume 2, Part 2: External Interrupt Architecture...
  • Page 863 I/O Architecture I/O devices can be accessed from Itanium architecture-based programs using regular loads and stores to uncacheable space. While cacheable Itanium memory references may be reordered by the processor, uncacheable I/O references are always presented to the platform in program order. This “sequentiality” of uncacheable references is discussed in Section 2.2.2, “Memory Attributes”...
  • Page 864 The mf.a instruction on the other hand ensures that all prior data memory references made by the processor issuing the mf.a have been “accepted” by the external platform. However by itself the mf.a does not guarantee that all cache coherent agents have observed all prior memory operations.
  • Page 865 As a result of the spreading-out of the I/O ports into individual 4KB pages, Itanium architecture-based operating system code can control IA-32 IN, OUT instruction and IA-32 or Itanium load/store accessibility to blocks of 4 virtual I/O ports using the TLBs. This allows Itanium architecture-based operating systems to securely map devices that inhabit the I/O port space to different Itanium architecture-based device drivers or to user-space Itanium architecture-based applications.
  • Page 866 mask = r19 alloc r13 = ar.pfs, 2, 0, 0, 0 // 2 in, 0 local, 0 out, 0 rot movl base_addr = io_port_base extr.u port_offset = in0, 2, 14 mask = 0xfff port_addr = [base_addr] port_offset = port_offset, 12 in0 = mask, in0 port_offset = port_offset, in0 port_addr = port_addr, port_offset...
  • Page 867 Performance Monitoring Support Processors based on the Itanium architecture include a minimum of four performance counters which can be programmed to count processor events. These event counts can be used to analyze both hardware and software performance. Performance counters can be configured to generate a counter overflow interrupt. This interrupt can be used for event- or time-based profiling.
  • Page 868 The PAL firmware provides information about the performance monitor registers that are implemented on the processor through the PAL_PERF_MON_INFO PAL call. Information provided by the PAL includes bit masks which indicate which PMC/PMD registers are implemented on this processor model, as well as the implemented number of generic PMC/PMD pairs, and the counter width of the generic counters.
  • Page 869 model-specific processor monitoring capabilities, and is a well-defined isolated and easily replaceable software component. The following operating system services allow a kernel mode device driver to take full advantage of the performance monitors: • Allocation/Free Performance monitors – operating system should delegate management of the performance monitor resources to device driver.
  • Page 870 2:622 Volume 2, Part 2: Performance Monitoring Support...
  • Page 871 Section 1.2, “Related Documents” on page 2:505. The PAL layer is developed by Intel Corporation and delivered with the processor. The SAL, UEFI and ACPI firmware is developed by the platform manufacturer and provide a means of supporting value added platform features from different vendors.
  • Page 872 The order of steps within the UEFI/SAL firmware is platform implementation dependent and may vary. In general, the UEFI/SAL firmware selects a Bootstrap processor (BSP) in multiprocessor (MP) configurations early in the boot sequence. Next, UEFI/SAL will find and initialize memory and invoke PAL procedures to conduct additional processor tests to ensure the health of the processors.
  • Page 873 The UEFI Boot Manager displays the list of operating system choices and permits the user to select the operating system for booting. To support this functionality, the OS setup program stores the boot paths of the OS loaders and boot options in non-volatile storage managed by the UEFI firmware.
  • Page 874 Figure 13-2. Control Flow of Boot Process in a Multiprocessor Configuration Power On Optional Update Firmware Recovery? PALE_RESET Do System Reset PAL_RESET SALE_ENTRY SAL_RESET BSP Selection Rendez BSP? Rendezvous_1 Interrupt? Initialization PAL Late Self-test & Memory Test PAL Late Self-test Rendezvous_2 Wake APs for PAL Late Self-test...
  • Page 875 The register stack should be invalidated. This can be done by setting the Register Stack Configuration Register (RSC) to zero followed by a loadrs instruction. Setting the RSC to zero will put the register stack in enforced lazy mode and set the RSC.loadrs, load distance to tear point, to zero.
  • Page 876 Before enabling virtual addressing, the Interruption Instruction Bundle Pointer (IIP) is set to point a virtual address. This is done so when the return from interruption instruction (rfi) is executed the instruction fetched will have a virtual address. The rfi will switch modes based on IPSR values which are moved into the PSR.
  • Page 877 GetFeaturesCall: mov r14 = ip // Get the ip of the current bundle movl r28 = PAL_PROC_GET_FEATURES// Index of the PAL procedure movl r4 = AddressOfPALProc;;// Address of the PAL proc entry point ld8 r4 = [r4];;// Read address from local pointer mov b5 = r4 // Move address into a branch register // Compute the return address in a position independent manner...
  • Page 878 movl r4 = AddressOfPALProc;;// Address of the PAL proc entry point ld8 r4 = [r4];;// Read address from local pointer mov b5 = r4 // Move address into a branch register // Make the PAL_HALT_INFO procedure call. PAL_HALT_INFO uses stacked register // convention and parameters are passed with in0-in3 mov r28 = PAL_HALT_INFO;;// Index of the PAL procedure...
  • Page 879 the EfiExitBootServices() procedure. After this call, UEFI boot services may no longer be invoked by the OS. The UEFI runtime services execute in physical mode until the OS invokes the EFISetVirtualAddress() function to switch the UEFI to virtual mode. After this point, the UEFI runtime services may be invoked in virtual mode only.
  • Page 880 In general, if SAL needs to invoke a PAL procedure, it will do so in the same addressing mode in which it was called by the OS (i.e. without changing the PSR.dt, PSR.rt, and PSR.it bits). If a particular PAL procedure can only be invoked in physical mode, SAL will turn off translations and then invoke the PAL procedure.
  • Page 881 Figure 13-3. Correctable Machine Check Code Flow PAL_MC_RESUME OS_MCA SAL_CHECK PAL_CHECK Log Error Interrupt Return to Execution Context Figure 13-4. Uncorrectable Machine Check Code Flow OS_MCA PAL_CHECK SAL_CHECK Correct/Log Error For multiprocessor systems, machine checks are classified as local and global. A global MCA implies a system wide broadcast by hardware of an error condition.
  • Page 882 • Attempt to contain the error by requesting a rendezvous for all processors in the system if needed. • Hand off control to SAL for further processing, such as error logging. • Return processor error log information upon request by SAL. •...
  • Page 883 When an uncorrected machine check event occurs, SAL will invoke the OS_MCA handler. The functionality of this handler is dependent on the OS. At a minimum, it must call a SAL procedure to retrieve the error logging and state information and then call another SAL procedure to release these resources for future error logging and state save.
  • Page 884 Figure 13-5. INIT Flow PAL_INIT INIT Event SAL_INIT Write processor / platform info to save area INIT due to failure to respond to rendezvous interrupt? SAL_MC_RENDEZ Wake up Interrupt OS_INIT procedures valid? OS_INIT Return value from OS Warm boot Return to Interrupted Context SAL implementation-specific...
  • Page 885 13.3.3 PMI Flows Processors based on the Itanium architecture implement the Platform Management Interrupt (PMI) to enable platform developers to provide high level system functions, such as power management and security, in a manner that is transparent not only to the application software but also to the operating system.
  • Page 886 than the performance_index returned by PAL_GET_PSTATE, the caller responds by transitioning the processor to a lower performance P-state, which consumes less power and operates at reduced performance. Figure 13-6. Flowchart Showing P-state Feedback Policy (1) getperfindex = PAL_GET_PSTATE (2) OS computes newpstate index from busy ratio and getperfindex Reset newpstate == getperfindex?
  • Page 887 Code Examples OS Boot Flow Sample Code The sample code given below is a example of setting up operating system register state to prepare the processor for running in virtual mode as described in Section 13.1.2, “Operating System Boot Steps” on page 2:625.
  • Page 888 (p6)br.cond.sptk.few.clr Loader_RRLoop // Disable the VHPT walker and set up the minimum size for it (32K) by writing // to the page table address register (cr.pta) mov r2 = (15<<2) mov cr.pta = r2 // Initialize the protection key registers for kernel mov r2 = (1<<...
  • Page 889 The Translation Insertion Format looks like the following... Below is the register interface to insert entries into the TLB //1) A general register contains an address,attributes,and permissions //2) ITIR: additional info such as protection key page size info //3) IFA: specifies the virtual page number for instruction and data TLB inserts //Registers used: //---------------...
  • Page 890 movl r2 = 0x0 // use vpn 0 cr.ifa = r2 //Setup ITIR (Interruption TLB Insertion Register) movl r3 = ( ( 24 << 2 ) | ( 0 << 8 ) ) // 16 MB cr.itir = r3 //Now setup the general register to use with itr (insert translation //register) movl r10 =( (1 <<...
  • Page 892 ® ® Intel Itanium Architecture Software Developer’s Manual ® ® Volume 3: Intel Itanium Instruction Set Reference Revision 2.3 May 2010 Document Number: 323207...
  • Page 893 Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling1-800-548-4725, or by visiting Intel's website at http://www.intel.com.
  • Page 894 Part 1: Application Architecture Guide ......3:1 1.1.2 Part 2: Optimization Guide for the Intel® Itanium® Architecture ..3:1 Overview of Volume 2: System Architecture.
  • Page 895 Function of getf.sig ............3:143 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 896 Floating-point Class Relations ....... . 3:64 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 897: Relationship Between Instruction Type And Execution Unit Type

    Multimedia ALU Size 1 4-bit+2-bit Opcode Extensions ....3:307 4-14 Multimedia ALU Size 2 4-bit+2-bit Opcode Extensions ....3:307 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 898 Floating-point Arithmetic 1-bit Opcode Extensions ..... 3:358 4-65 Fixed-point Multiply Add and Select Opcode Extensions ....3:358 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 899 Instruction Classes ........3:389 § ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
  • Page 900 IA-32 application interface. This volume also describes optimization techniques used to generate high performance software. 1.1.1 Part 1: Application Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. ® ®...
  • Page 901 1.2.1 Part 1: System Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. ® ®...
  • Page 902 Chapter 9, “IA-32 Interruption Vector Descriptions” lists IA-32 exceptions, interrupts and intercepts that can occur during IA-32 instruction set execution in the Itanium System Environment. ® Chapter 10, “Itanium Architecture-based Operating System Interaction Model with IA-32 Applications” defines the operation of IA-32 instructions within the Itanium System Environment from the perspective of an Itanium architecture-based operating system.
  • Page 903 Instruction Set Reference This volume is a comprehensive reference to the Itanium instruction set, including instruction format/encoding. ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. Chapter 2, “Instruction Reference”...
  • Page 904 These resources include instructions and registers. Itanium Architecture – The new ISA with 64-bit instruction capabilities, new performance- enhancing features, and support for the IA-32 instruction set. IA-32 Architecture – The 32-bit and 16-bit Intel architecture as described in the ® Intel 64 and IA-32 Architectures Software Developer’s Manual.
  • Page 905 ® • Intel 64 and IA-32 Architectures Software Developer’s Manual – This set of manuals describes the Intel 32-bit architecture. They are available from the Intel Literature Department by calling 1-800-548-4725 and requesting Document Numbers 243190, 243191and 243192. ® ®...
  • Page 906 Date of Revision Description Revision Number August 2005 Allow register fields in CR.LID register to be read-only and CR.LID checking on interruption messages by processors optional. See Vol 2, Part I, Ch 5 “Interruptions” and Section 11.2.2 PALE_RESET Exit State for details. Relaxed reserved and ignored fields checkings in IA-32 application registers in Vol 1 Ch 6 and Vol 2, Part I, Ch 10.
  • Page 907 Date of Revision Description Revision Number August 2002 Added Predicate Behavior of alloc Instruction Clarification (Section 4.1.2, Part I, Volume 1; Section 2.2, Part I, Volume 3). Added New fc.i Instruction (Section 4.4.6.1, and 4.4.6.2, Part I, Volume 1; Section 4.3.3, 4.4.1, 4.4.5, 4.4.6, 4.4.7, 5.5.2, and 7.1.2, Part I, Volume 2; Section 2.5, 2.5.1, 2.5.2, 2.5.3, and 4.5.2.1, Part II, Volume 2;...
  • Page 908 Date of Revision Description Revision Number Volume 2: Class pr-writers-int clarification (Table A-5). PAL_MC_DRAIN clarification (Section 4.4.6.1). VHPT walk and forward progress change (Section 4.1.1.2). IA-32 IBR/DBR match clarification (Section 7.1.1). ISR figure changes (pp. 8-5, 8-26, 8-33 and 8-36). PAL_CACHE_FLUSH return argument change –...
  • Page 909 Date of Revision Description Revision Number Volume 2: Clarifications regarding “reserved” fields in ITIR (Chapter 3). Instruction and Data translation must be enabled for executing IA-32 instructions (Chapters 3,4 and 10). FCR/FDR mappings, and clarification to the value of PSR.ri after an RFI (Chapters 3 and 4).
  • Page 910 Instruction Reference This chapter describes the function of each Itanium instruction. The pages of this chapter are sorted alphabetically by assembly language mnemonic. Instruction Page Conventions The instruction pages are divided into multiple sections as listed in Table 2-1. The first three sections are present on all instruction pages.
  • Page 911 (64-bits not including the NaT bit) where the notation GR[addr] is used. The syntactical differences between the code found in the Operation section and ANSI C is listed in Table 2-4. Table 2-3. Register File Notation Assembly Indirect Register File C Notation Mnemonic Access...
  • Page 912 Table 2-5. Pervasive Conditions Not Included in Instruction Description Code Condition Action Read of a register outside the current frame. An undefined value is returned (no fault). Access to a banked general register (GR 16 through GR 31). The GR bank specified by PSR.bn is accessed. PSR.ss is set.
  • Page 913 add — Add ) add register_form Format: ) add plus1_form, register_form ) add pseudo-op ) adds imm14_form ) addl imm22_form The two source operands (and an optional constant 1) are added and the result placed Description: in GR . In the register form the first operand is GR ;...
  • Page 914 addp4 addp4 — Add Pointer ) addp4 register_form Format: ) addp4 imm14_form The two source operands are added. The upper 32 bits of the result are forced to zero, Description: and then bits {31:30} of GR are copied to bits {62:61} of the result. This result is placed in GR .
  • Page 915 alloc alloc — Allocate Stack Frame ) alloc = ar.pfs, Format: A new stack frame is allocated on the general register stack, and the Previous Function Description: State register (PFS) is copied to GR . The change of frame size is immediate. The write of GR and subsequent instructions in the same instruction group use the new frame.
  • Page 916 alloc Operation: // tmp_sof, tmp_sol, tmp_sor are the fields encoded in the instruction tmp_sof = i + l + o; tmp_sol = i + l; tmp_sor = r u>> 3; check_target_register_sof(r , tmp_sof); if (tmp_sof u> 96 || r u> tmp_sof || tmp_sol u> tmp_sof || qp != 0) illegal_operation_fault();...
  • Page 917 and — Logical And ) and register_form Format: ) and imm8_form The two source operands are logically ANDed and the result placed in GR . In the Description: register_form the first operand is GR ; in the imm8_form the first operand is taken from the encoding field.
  • Page 918 andcm andcm — And Complement ) andcm register_form Format: ) andcm imm8_form The first source operand is logically ANDed with the 1’s complement of the second Description: source operand and the result placed in GR . In the register_form the first operand is ;...
  • Page 919 br — Branch ) br. ip_relative_form Format: btype dh target ) br. call_form, ip_relative_form btype dh b target counted_form, ip_relative_form btype dh target pseudo-op dh target ) br. indirect_form btype dh b ) br. call_form, indirect_form btype dh b pseudo-op dh b A branch condition is evaluated, and either a branch is taken, or execution continues Description:...
  • Page 920 the branch condition is simply the value of the specified predicate register. These basic branch types are: • cond: If the qualifying predicate is 1, the branch is taken. Otherwise it is not taken. • call: If the qualifying predicate is 1, the branch is taken and several other actions occur: •...
  • Page 921 group as br.ia are not allowed, since br.ia may implicitly reads all ARs. If an illegal RAW dependency is present between an AR write and br.ia, the first IA-32 instruction fetch and execution may or may not see the updated AR value. IA-32 instruction set execution leaves the contents of the ALAT undefined.
  • Page 922 The modulo-scheduled loop types are: • ctop and cexit: These branch types behave identically, except in the determination of whether to branch or not. For br.ctop, the branch is taken if either LC is non-zero or EC is greater than one. For br.cexit, the opposite is true. It is not taken if either LC is non-zero or EC is greater than one and is taken otherwise.
  • Page 923 Figure 2-4. Operation of br.wtop and br.wexit wtop, wexit ==0 (Prolog / Epilog) (Special PR[qp]? Unrolled Loops) > 1 == 0 == 1 (Prolog / Kernel) (Prolog / Epilog) == 1 (Epilog) EC-- EC-- EC = EC EC = EC PR[63] = 0 PR[63] = 0 PR[63] = 0...
  • Page 924 Table 2-7. Branch Whether Hint bwh Completer Branch Whether Hint spnt Static Not-Taken sptk Static Taken dpnt Dynamic Not-Taken dptk Dynamic Taken Table 2-8. Sequential Prefetch Hint ph Completer Sequential Prefetch Hint few or none Few lines many Many lines Table 2-9.
  • Page 925 tmp_taken = PR[qp]; if (tmp_taken) { // tmp_growth indicates the amount to move logical TOP *up*: // tmp_growth = sizeof(previous out) - sizeof(current frame) // a negative amount indicates a shrinking stack tmp_growth = (AR[PFS].pfm.sof - AR[PFS].pfm.sol) - CFM.sof; alat_frame_update(-AR[PFS].pfm.sol, 0); rse_fatal = rse_restore_frame(AR[PFS].pfm.sol, tmp_growth, CFM.sof);...
  • Page 926 illegal_operation_fault(); tmp_taken = (AR[LC] != 0); if (AR[LC] != 0) AR[LC]--; break; case ‘ctop’: case ‘cexit’: // SW pipelined counted loop if (slot != 2) illegal_operation_fault(); if (btype == ‘ctop’) tmp_taken = ((AR[LC] != 0) || (AR[EC] u> 1)); if (btype == ‘cexit’)tmp_taken = !((AR[LC] != 0) || (AR[EC] u> 1)); if (AR[LC] != 0) { AR[LC]--;...
  • Page 927 taken_branch = 1; IP = tmp_IP; // set the new value for IP if (!impl_uia_fault_supported() && ((PSR.it && unimplemented_virtual_address(tmp_IP, PSR.vm)) || (!PSR.it && unimplemented_physical_address(tmp_IP)))) unimplemented_instruction_address_trap(lower_priv_transition, tmp_IP); if (lower_priv_transition && PSR.lp) lower_privilege_transfer_trap(); if (PSR.tb) taken_branch_trap(); Illegal Operation fault Lower-Privilege Transfer trap Interruptions: Disabled Instruction Set Transition fault Taken Branch trap...
  • Page 928 break break — Break ) break pseudo-op Format: ) break.i i_unit_form ) break.b b_unit_form ) break.m m_unit_form ) break.f f_unit_form ) break.x x_unit_form A Break Instruction fault is taken. For the i_unit_form, f_unit_form and m_unit_form, Description: the value specified by is zero-extended and placed in the Interruption Immediate control register (IIM).
  • Page 929 brl — Branch Long ) brl. Format: btype dh target ) brl. call_form btype dh b target brl. pseudo-op dh target A branch condition is evaluated, and either a branch is taken, or execution continues Description: with the next sequential instruction. The execution of a branch logically follows the execution of all previous non-branch instructions in the same instruction group.
  • Page 930 system is required to provide an Illegal Operation fault handler which emulates taken and not-taken long branches. Presence of this instruction is indicated by a 1 in the lb bit of CPUID register 4. See Section 3.1.11, “Processor Identification Registers” on page 1:34.
  • Page 931 brp — Branch Predict brp. ip_relative_form Format: ipwh ih target brp. indirect_form indwh ih b brp.ret. return_form, indirect_form indwh ih b This instruction can be used to provide to hardware early information about a future Description: branch. It has no effect on architectural machine state, and operates as a nop instruction except for its performance effects.
  • Page 932 Operation: tmp_tag = IP + sign_ext((timm << 4), 13); if (ip_relative_form) { tmp_target = IP + sign_ext((imm << 4), 25); tmp_wh = ipwh; } else { // indirect_form tmp_target = BR[b tmp_wh = indwh; branch_predict(tmp_wh, ih, return_form, tmp_target, tmp_tag); None Interruptions: Volume 3: Instruction Reference 3:33...
  • Page 933 bsw — Bank Switch bsw.0 zero_form Format: bsw.1 one_form This instruction switches to the specified register bank. The zero_form specifies Bank 0 Description: for GR16 to GR31. The one_form specifies Bank 1 for GR16 to GR31. After the bank switch the previous register bank is no longer accessible but does retain its current state.
  • Page 934 chk — Speculation Check ) chk.s pseudo-op Format: target ) chk.s.i control_form, i_unit_form, gr_form target ) chk.s.m control_form, m_unit_form, gr_form target ) chk.s control_form, fr_form target ) chk.a. data_form, gr_form aclr r target ) chk.a. data_form, fr_form aclr f target The result of a control- or data-speculative calculation is checked for success or failure.
  • Page 935 Operation: if (PR[qp]) { if (control_form) { if (fr_form && (tmp_isrcode = fp_reg_disabled(f , 0, 0, 0))) disabled_fp_register_fault(tmp_isrcode, 0); check_type = gr_form ? CHKS_GENERAL : CHKS_FLOAT; fail = (gr_form && GR[r ].nat) || (fr_form && FR[f ] == NATVAL); } else { // data_form if (gr_form) { reg_type...
  • Page 936 clrrrb clrrrb — Clear RRB clrrrb all_form Format: clrrrb.pr pred_form In the all_form, the register rename base registers (CFM.rrb.gr, CFM.rrb.fr, and Description: CFM.rrb.pr) are cleared. In the pred_form, the single register rename base register for the predicates (CFM.rrb.pr) is cleared. This instruction must be the last instruction in an instruction group;...
  • Page 937 clz — Count Leading Zeros ) clz Format: The number of leading zeros in GR is placed in GR Description: An Illegal Operation fault is raised on processor models that do not support the instruction. CPUID register 4 indicates the presence of the feature on the processor model.
  • Page 938 cmp — Compare ) cmp. register_form Format: crel ctype p ) cmp. imm8_form crel ctype p ) cmp. = r0, parallel_inequality_form crel ctype p ) cmp. , r0 pseudo-op crel ctype p The two source operands are compared for one of ten relations specified by crel. This Description: produces a boolean result which is 1 if the comparison condition is true, and 0 otherwise.
  • Page 939 simply uses the negative relation with an implemented type. The implemented relations and how the pseudo-ops map onto them are shown in Table 2-16 (for normal and unc type compares), and Table 2-17 (for parallel type compares). Table 2-16. 64-bit Comparison Relations for Normal and unc Compares Compare Relation Register Form is a Immediate Form is a...
  • Page 940 Operation: if (PR[qp]) { if (p == p illegal_operation_fault(); tmp_nat = (register_form ? GR[r ].nat : 0) || GR[r ].nat; if (register_form) tmp_src = GR[r else if (imm8_form) tmp_src = sign_ext(imm , 8); else // parallel_inequality_form tmp_src = 0; (crel == ‘eq’) tmp_rel = tmp_src == GR[r else if (crel == ‘ne’) tmp_rel = tmp_src != GR[r...
  • Page 941 illegal_operation_fault(); PR[p ] = 0; PR[p ] = 0; Illegal Operation fault Interruptions: 3:42 Volume 3: Instruction Reference...
  • Page 942 cmp4 cmp4 — Compare 4 Bytes ) cmp4. register_form Format: crel ctype p ) cmp4. imm8_form crel ctype p ) cmp4. = r0, parallel_inequality_form crel ctype p ) cmp4. , r0 pseudo-op crel ctype p The least significant 32 bits from each of two source operands are compared for one of Description: ten relations specified by crel.
  • Page 943 cmp4 Operation: if (PR[qp]) { if (p == p illegal_operation_fault(); tmp_nat = (register_form ? GR[r ].nat : 0) || GR[r ].nat; if (register_form) tmp_src = GR[r else if (imm8_form) tmp_src = sign_ext(imm , 8); else // parallel_inequality_form tmp_src = 0; (crel == ‘eq’) tmp_rel = tmp_src{31:0} == GR[r ]{31:0};...
  • Page 944 cmp4 PR[p ] = 0; break; case ‘unc’: // unc-type compare default: // normal compare if (tmp_nat) { PR[p ] = 0; PR[p ] = 0; } else { PR[p ] = tmp_rel; PR[p ] = !tmp_rel; break; } else { if (ctype == ‘unc’) { if (p == p...
  • Page 945 cmpxchg cmpxchg — Compare and Exchange ) cmpxchg , ar.ccv Format: ldhint r ) cmp8xchg16. , ar.csd, ar.ccv sixteen_byte_form ldhint r A value consisting of sz bytes (8 bytes for cmp8xchg16) is read from memory starting at Description: the address specified by the value in GR .
  • Page 946 cmpxchg affect program functionality and may be ignored by the implementation. See Section 4.4.6, “Memory Hierarchy Control and Consistency” on page 1:69 for details. For cmp8xchg16, Illegal Operation fault is raised on processor models that do not support the instruction. CPUID register 4 indicates the presence of the feature on the processor model.
  • Page 947 cover cover — Cover Stack Frame cover Format: A new stack frame of zero size is allocated which does not include any registers from Description: the previous frame (as though all output registers in the previous frame had been locals). The register rename base registers are reset. If interruption collection is disabled (PSR.ic is zero), then the old value of the Current Frame Marker (CFM) is copied to the Interruption Function State register (IFS), and IFS.v is set to one.
  • Page 948 czx — Compute Zero Index ) czx1.l one_byte_form, left_form Format: ) czx1.r one_byte_form, right_form ) czx2.l two_byte_form, left_form ) czx2.r two_byte_form, right_form is scanned for a zero element. The element is either an 8-bit aligned byte Description: (one_byte_form) or a 16-bit aligned pair of bytes (two_byte_form). The index of the first zero element is placed in GR .
  • Page 949 else if ((GR[r ] & 0x0000ffff00000000) == 0) GR[r ] = 2; else if ((GR[r ] & 0xffff000000000000) == 0) GR[r ] = 3; else GR[r ] = 4; GR[r ].nat = GR[r ].nat; Illegal Operation fault Interruptions: 3:50 Volume 3: Instruction Reference...
  • Page 950 dep — Deposit ) dep merge_form, register_form Format: ) dep merge_form, imm_form , pos ) dep.z zero_form, register_form ) dep.z zero_form, imm_form In the merge_form, a right justified bit field taken from the first source operand is Description: deposited into the value in GR r at an arbitrary bit position and the result is placed in GR r .
  • Page 951 Operation: if (PR[qp]) { check_target_register(r if (imm_form) { tmp_src = (merge_form ? sign_ext(imm ,1) : sign_ext(imm , 8)); tmp_nat = merge_form ? GR[r ].nat : 0; tmp_len = len } else { // register_form tmp_src = GR[r tmp_nat = (merge_form ? GR[r ].nat : 0) || GR[r ].nat;...
  • Page 952 epc — Enter Privileged Code Format: This instruction increases the privilege level. The new privilege level is given by the TLB Description: entry for the page containing this instruction. This instruction can be used to implement calls to higher-privileged routines without the overhead of an interruption. Before increasing the privilege level, a check is performed.
  • Page 953 extr extr — Extract ) extr signed_form Format: ) extr.u unsigned_form A field is extracted from GR , either zero extended or sign extended, and placed Description: right-justified in GR . The field begins at the bit position given by the second operand and extends bits to the left.
  • Page 954 fabs fabs — Floating-point Absolute Value ) fabs pseudo-op of: ( ) fmerge.s = f0, Format: The absolute value of the value in FR is computed and placed in FR Description: If FR is a NaTVal, FR is set to NaTVal instead of the computed result. Operation: See “fmerge —...
  • Page 955 fadd fadd — Floating-point Add ) fadd. pseudo-op of: ( ) fma. , f1, Format: sf f sf f and FR are added (computed to infinite precision), rounded to the precision Description: indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc, and placed in FR .
  • Page 956 famax famax — Floating-point Absolute Maximum ) famax. Format: sf f The operand with the larger absolute value is placed in FR . If the magnitude of FR Description: equals the magnitude of FR , FR gets FR If either FR or FR is a NaN, FR gets FR...
  • Page 957 famin famin — Floating-point Absolute Minimum ) famin. Format: sf f The operand with the smaller absolute value is placed in FR . If the magnitude of FR Description: equals the magnitude of FR , FR gets FR If either FR or FR is a NaN, FR gets FR...
  • Page 958 fand fand — Floating-point Logical And ) fand Format: The bit-wise logical AND of the significand fields of FR and FR is computed. The Description: resulting value is stored in the significand field of FR . The exponent field of FR is set to the biased exponent for 2.0 (0x1003E) and the sign field of FR...
  • Page 959 fandcm fandcm — Floating-point And Complement ) fandcm Format: The bit-wise logical AND of the significand field of FR with the bit-wise complemented Description: significand field of FR is computed. The resulting value is stored in the significand field of FR .
  • Page 960 fc — Flush Cache ) fc invalidate_line_form Format: ) fc.i instruction_cache_coherent_form In the invalidate_line form, the cache line associated with the address specified by the Description: value of GR r is invalidated from all levels of the processor cache hierarchy. The invalidation is broadcast throughout the coherence domain.
  • Page 961 Register NaT Consumption fault Data TLB fault Interruptions: Unimplemented Data Address fault Data Page Not Present fault Data Nested TLB fault Data NaT Page Consumption fault Alternate Data TLB fault Data Access Rights fault VHPT Data fault 3:62 Volume 3: Instruction Reference...
  • Page 962 fchkf fchkf — Floating-point Check Flags ) fchkf. Format: sf target The flags in FPSR.sf.flags are compared with FPSR.s0.flags and FPSR.traps. If any flags Description: set in FPSR.sf.flags correspond to FPSR.traps which are enabled, or if any flags set in FPSR.sf.flags are not set in FPSR.s0.flags, then a branch to is taken.
  • Page 963 fclass fclass — Floating-point Class ) fclass. Format: fcrel fctype p fclass The contents of FR are classified according to the completer as shown in Description: fclass Table 2-25. This produces a boolean result based on whether the contents of FR agrees with the floating-point number format specified by , as specified by the fclass...
  • Page 964 fclass Operation: if (PR[qp]) { if (p == p illegal_operation_fault(); if (tmp_isrcode = fp_reg_disabled(f , 0, 0, 0)) disabled_fp_register_fault(tmp_isrcode, 0); tmp_rel = ((fclass {0} && !FR[f ].sign || fclass {1} && FR[f ].sign) && ((fclass {2} && fp_is_zero(FR[f ]))|| (fclass {3} &&...
  • Page 965 fclrf fclrf — Floating-point Clear Flags ) fclrf. Format: The status field’s 6-bit flags field is reset to zero. Description: The mnemonic values for sf are given in Table 2-23 on page 3:56. Operation: if (PR[qp]) { fp_set_sf_flags(sf, 0); None FP Exceptions: None Interruptions:...
  • Page 966 fcmp fcmp — Floating-point Compare ) fcmp. Format: frel fctype sf p The two source operands are compared for one of twelve relations specified by frel. This Description: produces a boolean result which is 1 if the comparison condition is true, and 0 otherwise.
  • Page 967 fcmp Operation: if (PR[qp]) { if (p == p illegal_operation_fault(); if (tmp_isrcode = fp_reg_disabled(f , 0, 0)) disabled_fp_register_fault(tmp_isrcode, 0); if (fp_is_natval(FR[f ]) || fp_is_natval(FR[f ])) { PR[p ] = 0; PR[p ] = 0; } else { fcmp_exception_fault_check(f , frel, sf, &tmp_fp_env); if (fp_raise_fault(tmp_fp_env)) fp_exception_fault(fp_decode_fault(tmp_fp_env));...
  • Page 968 fcmp Invalid Operation (V) FP Exceptions: Denormal/Unnormal Operand (D) Software Assist (SWA) fault Illegal Operation fault Floating-point Exception fault Interruptions: Disabled Floating-point Register fault Volume 3: Instruction Reference 3:69...
  • Page 969 fcvt.fx fcvt.fx — Convert Floating-point to Integer ) fcvt.fx. signed_form Format: sf f ) fcvt.fx.trunc. signed_form, trunc_form sf f ) fcvt.fxu. unsigned_form sf f ) fcvt.fxu.trunc. unsigned_form, trunc_form sf f is treated as a register format floating-point value and converted to a signed Description: (signed_form) or unsigned integer (unsigned_form) using either the rounding mode specified in the FPSR.sf.rc, or using Round-to-Zero if the trunc_form of the instruction is...
  • Page 970 fcvt.fx Invalid Operation (V) Inexact (I) FP Exceptions: Denormal/Unnormal Operand (D) Software Assist (SWA) fault Illegal Operation fault Floating-point Exception fault Interruptions: Disabled Floating-point Register fault Floating-point Exception trap Volume 3: Instruction Reference 3:71...
  • Page 971 fcvt.xf fcvt.xf — Convert Signed Integer to Floating-point ) fcvt.xf Format: The 64-bit significand of FR is treated as a signed integer and its register file precision Description: floating-point representation is placed in FR If FR is a NaTVal, FR is set to NaTVal instead of the computed result.
  • Page 972 fcvt.xuf fcvt.xuf — Convert Unsigned Integer to Floating-point ) fcvt.xuf.pc.sf pseudo-op of: ( ) fma. , f1, f0 Format: sf f is multiplied with FR 1, rounded to the precision indicated by pc (and possibly Description: FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc, and placed in FR Note: Multiplying FR with FR 1 (a 1.0) normalizes the canonical representation of an...
  • Page 973 fetchadd fetchadd — Fetch and Add Immediate ) fetchadd4. four_byte_form Format: ldhint r ) fetchadd8. eight_byte_form ldhint r A value consisting of four or eight bytes is read from memory starting at the address Description: specified by the value in GR .
  • Page 974 fetchadd Operation: if (PR[qp]) { check_target_register(r if (GR[r ].nat) register_nat_consumption_fault(SEMAPHORE); size = four_byte_form ? 4 : 8; paddr = tlb_translate(GR[r ], size, SEMAPHORE, PSR.cpl, &mattr, &tmp_unused); if (!ma_supports_fetchadd(mattr)) unsupported_data_reference_fault(SEMAPHORE, GR[r if (sem == ‘acq’) val = mem_xchg_add(inc , paddr, size, UM.be, mattr, ACQUIRE, ldhint); else // ‘rel’...
  • Page 975 flushrs flushrs — Flush Register Stack flushrs Format: All stacked general registers in the dirty partition of the register stack are written to the Description: backing store before execution continues. The dirty partition contains registers from previous procedure frames that have not yet been saved to the backing store. For a description of the register stack partitions, refer to Chapter 6, “Register Stack Engine”...
  • Page 976 fma — Floating-point Multiply Add ) fma. Format: sf f The product of FR and FR is computed to infinite precision and then FR is added to Description: this product, again in infinite precision. The resulting value is then rounded to the precision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc.
  • Page 977 Illegal Operation fault Floating-point Exception fault Interruptions: Disabled Floating-point Register fault Floating-point Exception trap 3:78 Volume 3: Instruction Reference...
  • Page 978 fmax fmax — Floating-point Maximum ) fmax. Format: sf f The operand with the larger value is placed in FR . If FR equals FR , FR gets FR Description: If either FR or FR is a NaN, FR gets FR If either FR or FR is a NaTVal, FR...
  • Page 979 fmerge fmerge — Floating-point Merge ) fmerge.ns neg_sign_form Format: ) fmerge.s sign_form ) fmerge.se sign_exp_form Sign, exponent and significand fields are extracted from FR and FR , combined, and Description: the result is placed in FR For the neg_sign_form, the sign of FR is negated and concatenated with the exponent and the significand of FR .
  • Page 980 fmerge Operation: if (PR[qp]) { fp_check_target_register(f if (tmp_isrcode = fp_reg_disabled(f , 0)) disabled_fp_register_fault(tmp_isrcode, 0); if (fp_is_natval(FR[f ]) || fp_is_natval(FR[f ])) { FR[f ] = NATVAL; } else { FR[f ].significand = FR[f ].significand; if (neg_sign_form) { FR[f ].exponent = FR[f ].exponent;...
  • Page 981 fmin fmin — Floating-point Minimum ) fmin. Format: sf f The operand with the smaller value is placed in FR . If FR equals FR , FR gets FR Description: If either FR or FR is a NaN, FR gets FR If either FR or FR is a NaTVal, FR...
  • Page 982 fmix fmix — Floating-point Mix ) fmix.l mix_l_form Format: ) fmix.r mix_r_form ) fmix.lr mix_lr_form For the mix_l_form (mix_r_form), the left (right) single precision value in FR Description: concatenated with the left (right) single precision value in FR . For the mix_lr_form, the left single precision value in FR is concatenated with the right single precision value in FR...
  • Page 983 fmix Operation: if (PR[qp]) { fp_check_target_register(f if (tmp_isrcode = fp_reg_disabled(f , 0)) disabled_fp_register_fault(tmp_isrcode, 0); if (fp_is_natval(FR[f ]) || fp_is_natval(FR[f ])) { FR[f ] = NATVAL; } else { if (mix_l_form) { tmp_res_hi = FR[f ].significand{63:32}; tmp_res_lo = FR[f ].significand{63:32}; } else if (mix_r_form) { tmp_res_hi = FR[f ].significand{31:0};...
  • Page 984 fmpy fmpy — Floating-point Multiply ) fmpy. pseudo-op of: ( ) fma. , f0 Format: sf f sf f The product FR and FR is computed to infinite precision. The resulting value is then Description: rounded to the precision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc.
  • Page 985 fms — Floating-point Multiply Subtract ) fms. Format: sf f The product of FR and FR is computed to infinite precision and then FR Description: subtracted from this product, again in infinite precision. The resulting value is then rounded to the precision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc.
  • Page 986 Illegal Operation fault Floating-point Exception fault Interruptions: Disabled Floating-point Register fault Floating-point Exception trap Volume 3: Instruction Reference 3:87...
  • Page 987 fneg fneg — Floating-point Negate ) fneg pseudo-op of: ( ) fmerge.ns Format: The value in FR is negated and placed in FR Description: If FR is a NaTVal, FR is set to NaTVal instead of the computed result. Operation: See “fmerge —...
  • Page 988 fnegabs fnegabs — Floating-point Negate Absolute Value ) fnegabs pseudo-op of: ( ) fmerge.ns = f0, Format: The absolute value of the value in FR is computed, negated, and placed in FR Description: If FR is a NaTVal, FR is set to NaTVal instead of the computed result. Operation: See “fmerge —...
  • Page 989 fnma fnma — Floating-point Negative Multiply Add ) fnma. Format: sf f The product of FR and FR is computed to infinite precision, negated, and then FR Description: is added to this product, again in infinite precision. The resulting value is then rounded to the precision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc.
  • Page 990 fnma Illegal Operation fault Floating-point Exception fault Interruptions: Disabled Floating-point Register fault Floating-point Exception trap Volume 3: Instruction Reference 3:91...
  • Page 991 fnmpy fnmpy — Floating-point Negative Multiply ) fnmpy. pseudo-op of: ( ) fnma. Format: sf f sf f The product FR and FR is computed to infinite precision and then negated. The Description: resulting value is then rounded to the precision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc.
  • Page 992 fnorm fnorm — Floating-point Normalize ) fnorm. pseudo-op of: ( ) fma. , f1, f0 Format: sf f sf f is normalized and rounded to the precision indicated by pc (and possibly Description: FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc, and placed in FR If FR is a NaTVal, FR...
  • Page 993 for — Floating-point Logical Or ) for Format: The bit-wise logical OR of the significand fields of FR and FR is computed. The Description: resulting value is stored in the significand field of FR . The exponent field of FR is set to the biased exponent for 2.0 (0x1003E) and the sign field of FR...
  • Page 994 fpabs fpabs — Floating-point Parallel Absolute Value ) fpabs pseudo-op of: ( ) fpmerge.s = f0, Format: The absolute values of the pair of single precision values in the significand field of FR Description: are computed and stored in the significand field of FR .
  • Page 995 fpack fpack — Floating-point Pack ) fpack pack_form Format: The register format numbers in FR and FR are converted to single precision memory Description: format. These two single precision numbers are concatenated and stored in the significand field of FR .
  • Page 996 fpamax fpamax — Floating-point Parallel Absolute Maximum ) fpamax. Format: sf f The paired single precision values in the significands of FR and FR are compared. Description: The operands with the larger absolute value are returned in the significand field of FR If the magnitude of high (low) FR is less than the magnitude of high (low) FR , high...
  • Page 997 fpamax Invalid Operation (V) FP Exceptions: Denormal/Unnormal Operand (D) Software Assist (SWA) fault Illegal Operation fault Floating-point Exception fault Interruptions: Disabled Floating-point Register fault 3:98 Volume 3: Instruction Reference...
  • Page 998 fpamin fpamin — Floating-point Parallel Absolute Minimum ) fpamin. Format: sf f The paired single precision values in the significands of FR or FR are compared. The Description: operands with the smaller absolute value is returned in the significand of FR If the magnitude of high (low) FR is less than the magnitude of high (low) FR , high...
  • Page 999 fpamin Invalid Operation (V) FP Exceptions: Denormal/Unnormal Operand (D) Software Assist (SWA) fault Illegal Operation fault Floating-point Exception fault Interruptions: Disabled Floating-point Register fault 3:100 Volume 3: Instruction Reference...
  • Page 1000 fpcmp fpcmp — Floating-point Parallel Compare ) fpcmp. Format: frel sf f The two pairs of single precision source operands in the significand fields of FR and FR Description: are compared for one of twelve relations specified by frel. This produces a boolean result which is a mask of 32 1’s if the comparison condition is true, and a mask of 32 0’s otherwise.
  • Page 1001 fpcmp Operation: if (PR[qp]) { fp_check_target_register(f if (tmp_isrcode = fp_reg_disabled(f , 0)) disabled_fp_register_fault(tmp_isrcode, 0); if (fp_is_natval(FR[f ]) || fp_is_natval(FR[f ])) { FR[f ] = NATVAL; } else { fpcmp_exception_fault_check(f , frel, sf, &tmp_fp_env); if (fp_raise_fault(tmp_fp_env)) fp_exception_fault(fp_decode_fault(tmp_fp_env)); tmp_fr2 = fp_reg_read_hi(f tmp_fr3 = fp_reg_read_hi(f (frel == ‘eq’) tmp_rel = fp_equal(tmp_fr2, tmp_fr3);...
  • Page 1002 fpcmp tmp_res_lo = (tmp_rel ? 0xFFFFFFFF : 0x00000000); FR[f ].significand = fp_concatenate(tmp_res_hi, tmp_res_lo); FR[f ].exponent = FP_INTEGER_EXP; FR[f ].sign = FP_SIGN_POSITIVE; fp_update_fpsr(sf, tmp_fp_env); fp_update_psr(f Invalid Operation (V) FP Exceptions: Denormal/Unnormal Operand (D) Software Assist (SWA) fault Illegal Operation fault Floating-point Exception fault Interruptions: Disabled Floating-point Register fault Volume 3: Instruction Reference...
  • Page 1003 fpcvt.fx fpcvt.fx — Convert Parallel Floating-point to Integer ) fpcvt.fx. signed_form Format: sf f ) fpcvt.fx.trunc. signed_form, trunc_form sf f ) fpcvt.fxu. unsigned_form sf f ) fpcvt.fxu.trunc. unsigned_form, trunc_form sf f The pair of single precision values in the significand field of FR is converted to a pair Description: of 32-bit signed integers (signed_form) or unsigned integers (unsigned_form) using...

This manual is also suitable for:

Itanium architecture 2.3

Table of Contents