Page 3
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling1-800-548-4725, or by visiting Intel's website at http://www.intel.com.
Part 1: Application Architecture Guide ......1:3 1.1.2 Part 2: Optimization Guide for the Intel® Itanium® Architecture ..1:3 Overview of Volume 2: System Architecture.
Page 6
Additions beyond the IEEE Standard ......1:107 ® ® IA-32 Application Execution Model in an Intel Itanium System Environment ..1:109 IA-32 Execution Layer .
Page 7
Software Pipelining ......... 1:183 ® ® Loop Support Features in the Intel Itanium Architecture ....1:184 5.4.1...
IA-32 application interface. This volume also describes optimization techniques used to generate high performance software. 1.1.1 Part 1: Application Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® ® ® Itanium Architecture Software Developer’s Manual.Intel...
1.2.1 Part 1: System Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. ® ®...
Chapter 9, “IA-32 Interruption Vector Descriptions” lists IA-32 exceptions, interrupts and intercepts that can occur during IA-32 instruction set execution in the Itanium System Environment. ® Chapter 10, “Itanium Architecture-based Operating System Interaction Model with IA-32 Applications” defines the operation of IA-32 instructions within the Itanium System Environment from the perspective of an Itanium architecture-based operating system.
Instruction Set Reference This volume is a comprehensive reference to the Itanium instruction set, including instruction format/encoding. ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. Chapter 2, “Instruction Reference”...
These resources include instructions and registers. Itanium Architecture – The new ISA with 64-bit instruction capabilities, new performance- enhancing features, and support for the IA-32 instruction set. IA-32 Architecture – The 32-bit and 16-bit Intel architecture as described in the ® Intel 64 and IA-32 Architectures Software Developer’s Manual.
® • Intel 64 and IA-32 Architectures Software Developer’s Manual – This set of manuals describes the Intel 32-bit architecture. They are available from the Intel Literature Department by calling 1-800-548-4725 and requesting Document Numbers 243190, 243191and 243192. ® ®...
Page 20
Date of Revision Description Revision Number August 2005 Allow register fields in CR.LID register to be read-only and CR.LID checking on interruption messages by processors optional. See Vol 2, Part I, Ch 5 “Interruptions” and Section 11.2.2 PALE_RESET Exit State for details. Relaxed reserved and ignored fields checkings in IA-32 application registers in Vol 1 Ch 6 and Vol 2, Part I, Ch 10.
Page 21
Date of Revision Description Revision Number August 2002 Added Predicate Behavior of alloc Instruction Clarification (Section 4.1.2, Part I, Volume 1; Section 2.2, Part I, Volume 3). Added New fc.i Instruction (Section 4.4.6.1, and 4.4.6.2, Part I, Volume 1; Section 4.3.3, 4.4.1, 4.4.5, 4.4.6, 4.4.7, 5.5.2, and 7.1.2, Part I, Volume 2; Section 2.5, 2.5.1, 2.5.2, 2.5.3, and 4.5.2.1, Part II, Volume 2;...
Page 22
Date of Revision Description Revision Number Volume 2: Class pr-writers-int clarification (Table A-5). PAL_MC_DRAIN clarification (Section 4.4.6.1). VHPT walk and forward progress change (Section 4.1.1.2). IA-32 IBR/DBR match clarification (Section 7.1.1). ISR figure changes (pp. 8-5, 8-26, 8-33 and 8-36). PAL_CACHE_FLUSH return argument change –...
Page 23
Date of Revision Description Revision Number Volume 2: Clarifications regarding “reserved” fields in ITIR (Chapter 3). Instruction and Data translation must be enabled for executing IA-32 instructions (Chapters 3,4 and 10). FCR/FDR mappings, and clarification to the value of PSR.ri after an RFI (Chapters 3 and 4).
Operating Environments The architectural model supports a mixture of IA-32 and Itanium architecture-based applications within a single Itanium architecture-based operating system. Table 2-1 defines the major supported operating environments. ® ® Volume 1, Part 1:Introduction to the Intel Itanium Architecture 1:13...
Table 2-1. Major Operating Environments System Application Usage Environment Environment ® ® Itanium System IA-32 Protected Mode IA-32 Protected Mode applications in the Intel Itanium System Environment Environment. ® ® IA-32 Real Mode IA-32 Real Mode applications in the Intel Itanium System Environment.
(see “Speculation” on page 1:16). In traditional architectures, procedure calls limit performance since registers need to be spilled and ® ® Volume 1, Part 1: Introduction to the Intel Itanium Architecture 1:15...
If the new control speculative load causes an exception, then the exception should only be serviced if (a>b) is true. When ® ® 1:16 Volume 1, Part 1: Introduction to the Intel Itanium Architecture...
To illustrate, an unpredicated instruction r1 = r2 + r3 when predicated, would be of the form ® ® Volume 1, Part 1: Introduction to the Intel Itanium Architecture 1:17...
The hardware can exploit the explicit register stack frame information to spill and fill registers from the register stack to memory at the best opportunity (independent of the calling and called procedures). ® ® 1:18 Volume 1, Part 1: Introduction to the Intel Itanium Architecture...
128 floating-point registers are defined. Of these, 96 registers are rotating (not stacked) and can be used to modulo schedule loops compactly. Multiple floating-point status registers are provided for speculation. ® ® Volume 1, Part 1: Introduction to the Intel Itanium Architecture 1:19...
They are useful for creating high performance compression/decompression algorithms that are used by applications which have sound and video. Itanium multimedia instructions are semantically compatible with HP’s MAX-2* multimedia technology and Intel’s MMX and SSE technology instructions. ®...
This following terms are used in the remainder of this document: • Itanium Instruction Set – The Itanium architecture defines the 64-bit instruction set extensions to the IA-32 architecture. • IA-32 Architecture – The 32-bit and 16-bit Intel architecture as described in the ® Intel 64 and IA-32 Architectures Software Developer’s Manual.
Page 33
§ ® ® 1:22 Volume 1, Part 1: Introduction to the Intel Itanium Architecture...
Execution Environment The architectural state consists of registers and memory. The results of instruction execution become architecturally visible according to a set of execution sequencing rules. This chapter describes the application architectural state and the rules for execution sequencing. See Chapter 6 for details on IA-32 instruction set execution.
ignore the value written. In variable-sized register sets, registers which are unimplemented in a particular processor are also reserved registers. An access to one of these unimplemented registers causes a Reserved Register/Field fault. Within defined registers, fields which are not defined are either reserved or ignored. For reserved fields, hardware will always return a zero on a read.
General registers 8 through 31 contain the IA-32 integer, segment selector and segment descriptor registers. See “IA-32 General Purpose Registers” on page 1:117 details on IA-32 register assignments. 3.1.3 Floating-point Registers A set of 128 (82-bit) floating-point registers are used for all floating-point computation.
3.1.6 Instruction Pointer The Instruction Pointer (IP) holds the address of the bundle which contains the current executing instruction. The IP can be read directly with a mov ip instruction. The IP cannot be directly written, but is incremented as instructions are executed, and can be set to a new value with a branch.
3.1.8 Application Registers The application register file includes special-purpose data registers and control registers for application-visible processor functions for both the IA-32 and Itanium instruction set architectures. These registers can be accessed by Itanium architecture-based applications (except where noted). Table 3-3 contains a list of the application registers.
Application registers can only be accessed by either a M or I execution unit. This is specified in the last column of the table. The ignored registers are for future backward-compatible extensions. Section 10.2, “System Register Model” on page 2:239 for the field definition of each IA-32 application register.
Figure 3-4. BSP Register Format pointer 3.1.8.4 RSE Backing Store Pointer for Memory Stores (BSPSTORE – AR 18) The RSE Backing Store Pointer for memory stores is a 64-bit register (Figure 3-5). It holds the address of the location in memory to which the RSE will spill the next value. Section 6.1, “RSE and Backing Store Overview”...
Page 42
3.1.8.8 User NaT Collection Register (UNAT – AR 36) The User NaT Collection Register is a 64-bit register used to temporarily hold NaT bits when saving and restoring general registers with the ld8.fill and st8.spill instructions. 3.1.8.9 Floating-point Status Register (FPSR – AR 40) The floating-point status register (FPSR) controls traps, rounding mode, precision control, flags, and other control bits for Itanium floating-point instructions.
System software can secure the resource utilization counter from non-privileged access. When secured, a read of the RUC at any privilege level other than the most privileged causes a Privileged Register fault. The RUC for a logical processor does not count when that logical processor is in LIGHT_HALT, unless all logical processors on a given physical processor are in LIGHT_HALT, in which case the last logical on a given physical processor to enter LIGHT_HALT has its RUC continue to count.
3.1.8.13 Loop Count Register (LC – AR 65) The Loop Count register (LC) is a 64-bit register used in counted loops. LC is decremented by counted-loop-type branches. 3.1.8.14 Epilog Count Register (EC – AR 66) The Epilog Count register (EC) is a 6-bit register used for counting the final (epilog) stages in modulo-scheduled loops.
0: unaligned data memory references may cause an Unaligned Data Reference fault. 1: all unaligned data memory references cause an Unaligned Data Reference fault. ® Lower (f2.. f31) floating-point registers written – This bit is set to one when an Intel ® Itanium instruction that uses register f2..f31 as a target register, completes.
Table 3-7. CPUID Register 3 Fields Field Bits Description number The index of the largest implemented CPUID register (one less than the number of implemented CPUID registers). This value will be at least 4. revision 15:8 Processor revision number. An 8-bit value that represents the revision or stepping of this processor implementation within the processor model.
Table 3-8. CPUID Register 4 Fields (Continued) Field Bits Description Processor implements mpy4 and mpyshl4 instructions (see “tf — Test Feature” instruction in Volume 63:34 Reserved. Memory This section describes an Itanium architecture-based application program’s view of memory. This includes a description of how memory is accessed, for both 32-bit and 64-bit applications.
larger-than-byte loads and stores are big endian (lower-addressed bytes in memory correspond to the higher-order bytes in the register). Load byte and store byte are not affected by the UM.be bit. The UM.be bit does not affect instruction fetch, IA-32 references, or the RSE.
Instruction Encoding Overview Each instruction is categorized into one of six types; each instruction type may be executed on one or more execution unit types. Table 3-9 lists the instruction types and the execution unit type on which they are executed. Table 3-9.
Page 51
4. Update architectural state, if necessary (update). An instruction group is a sequence of instructions starting at a given bundle address and slot number and including all instructions at sequentially increasing slot numbers and bundle addresses up to the first stop, taken branch, Break Instruction fault due to a break.b, or Illegal Operation fault due to a Reserved or Reserved if PR[qp] is one encoding in the B-type opcode space.
Page 52
The ordering rules above form the context for register dependency restrictions, memory dependency restrictions and the order of exception reporting. These dependency restrictions apply only between instructions whose resource reads and writes are not dynamically disabled by predication. • Register dependencies: Within an instruction group, read-after-write (RAW) and write-after-write (WAW) register dependencies are not allowed (except as noted in “RAW Dependency Special Cases”...
Page 53
The ordering rules and the dependency restrictions allow the processor to dynamically re-order instructions, execute instructions with non-unit latency, or even concurrently execute instructions on opposing sides of a stop or taken branch, provided that correct sequencing is enforced and the appearance of sequential execution is presented to the programmer.
Page 54
br.ia work like other instructions for the purposes of register dependency; i.e., if their qualifying predicate is 0, they are not considered readers or writers of other resources. Branches br.cloop, br.cexit, br.ctop, br.wexit, and br.wtop are exceptional in that they are always readers or writers of their resources, regardless of the value of their qualifying predicate.
Page 55
3.4.3 WAR Dependency Special Cases The WAR dependency between the reading of predicate register 63 by any B-type instruction and the subsequent writing of predicate register 63 by a modulo-scheduled loop type branch (br.ctop, br.cexit, br.wtop, or br.wexit) without an intervening stop is not allowed.
Page 56
• RAW and WAW register dependencies within the same instruction group are disallowed except as noted in Section 3.4, “Instruction Sequencing Considerations” on page 1:39. Their behavior within an instruction group is undefined. Undefined behavior includes the possibility of an Illegal Operation fault. •...
Page 57
1:46 Volume 1, Part 1: Execution Environment...
64 bits before use. The floating-point programming model is described separately in Chapter 5, “Floating-point Programming Model” in Volume 1. Refer to Volume 3: Intel® Itanium® Instruction Set Reference for detailed information on Itanium instructions. The main features of the programming model covered here are: •...
Page 59
The local and output areas of a frame can be re-sized using the alloc instruction which specifies immediates that determine the size of frame (sof) and size of locals (sol). Note: In the assembly language, alloc uses three immediate operands to determine the values of sol and sof: the size of inputs;...
Figure 4-1. Register Stack Behavior on Procedure Call and Return Instruction Execution Stacked GRs Frame Markers sol sof Local A Output A Caller’s Frame (procA) call Callee’s Frame (procB) Output B After Call alloc Callee’s Frame (procB) Local B Output B After alloc return Caller’s Frame (procA)
The flushrs instruction is used to force all previous stack frames out to backing store memory. It stalls instruction execution until all active frames in the physical register stack up to, but not including the current frame are spilled to the backing store by the RSE.
4.2.1 Arithmetic Instructions Addition and subtraction (add, sub) are supported with regular two input forms and special three input forms. The three input addition form adds one to the sum of two input registers. The three input subtraction form subtracts one from the difference of two input registers.
Table 4-4. Integer Logical Instructions Mnemonic Operation Logical and Logical or Logical and complement andcm Logical exclusive or 4.2.3 32-bit Addresses and Integers Support for 32-bit addresses is provided in the form of add instructions that perform region bit copying. This supports the virtual address translation model (see “32-bit Virtual Addressing”...
position of the field are specified by two immediates. This is essentially a shift-right-and-mask operation. A simple right shift by a fixed amount can be specified by using shr with an immediate value for the shift amount. This is just an assembly pseudo-op for an extract instruction where the field to be extracted extends all the way to the left-most register bit.
Compare Instructions and Predication A set of compare instructions provides the ability to test for various conditions and affect the dynamic execution of instructions. A compare instruction tests for a single specified condition and generates a boolean result. These results are written to predicate registers.
The 64-bit (cmp) and 32-bit (cmp4) compare instructions compare two registers, or a register and an immediate, for one of ten relations (e.g., >, <=). The compare instructions set two predicate targets according to the result. The cmp4 instruction compares the least-significant 32-bits of both sources (the most significant 32-bits are ignored).
The Unconditional compare type behaves the same as the Normal type, except that if the qualifying predicate is 0, both predicate targets are written with 0. This can be thought of as an initialization of the predicate targets, combined with a Normal compare.
4.3.4 Predicate Register Transfers Instructions are provided to transfer between the predicate register file and a general register. These instructions operate in a “broadside” manner whereby multiple predicate registers are transferred in parallel, such that predicate register N is transferred to/from bit N of a general register.
Load, store and semaphore instructions are summarized in Table 4-12 and the state related to memory reference instructions is summarized in Table 4-13. Table 4-12. Memory Access Instructions Mnemonic Floating-point Operation General Normal Load Pair Load ldfp Speculative load ld.s ldf.s ldfp.s Advanced load...
Page 70
The floating-point load pair instructions load two adjacent single precision (4 bytes each), double precision (8 bytes each), or integer/parallel FP (8 bytes each) numbers into two independent floating-point registers (see the ldfp instruction description for restrictions on target register specifiers). Floating-point load pair instructions can specify base register update, but only by an immediate value equal to double the data size.
Page 71
Three types of atomic semaphore operations are defined: exchange (xchg); compare and exchange (cmpxchg); and fetch and add (fetchadd). The xchg target is loaded with the zero-extended contents of the memory location addressed by the first source and then the second source is stored into the same memory location.
Page 72
indicates that the register contains a deferred exception token, and that its 64-bit data portion contains an implementation-specific value that software cannot rely upon. In floating-point registers, a deferred exception is indicated by a specific pseudo-zero encoding called the NaTVal (see “Representation of Values in Floating-point Registers”...
Page 73
For these instructions, if any source contains a deferred exception token, all predicate targets are either cleared or left unchanged, depending on the compare type (see Table 4-10 on page 1:56). Software can use this behavior to ensure that any dependent conditional branches are not taken and any dependent predicated instructions are nullified.
• The st8.spill may write a zero to the specified memory location, or • The st8.spill may write the register’s 64-bit data portion to memory, only if that implementation returns a zero into the target register of all NaTed speculative loads, and that implementation also guarantees that all NaT propagating instructions perform all computations as specified by the instruction pages.
4.4.5.1 Data Speculation Concepts An ambiguous memory dependency is said to exist between a store (or any operation that may update memory state) and a load when it cannot be statically determined whether the load and store might access overlapping regions of memory. For convenience, a store that cannot be statically disambiguated relative to a particular load is said to be ambiguous relative to that load.
speculation check (chk.s) in that, if the speculation was successful, execution continues inline and no recovery is necessary; if speculation was unsuccessful, the chk.a branches to compiler-generated recovery code. The recovery code contains instructions that will re-execute all the work that was dependent on the failed data speculative load up to the point of the check instruction.
Page 77
3. A new entry is allocated in the ALAT which contains the new ALAT register tag, the load access size, and a tag derived from the physical memory address. The insertion of the new ALAT entry must occur no later in visibility order than the load of the data.
Page 78
than the load of the data. If the check load was an ordered check load (ld.c.clr.acq), then it is performed with the semantics of an ordered load (ld.acq). ALAT register tag lookups by advanced load checks and check loads are subject to memory ordering constraints as outlined in “Memory Access Ordering”...
Page 79
3. Software accesses the RSE backing store with advanced loads. See Section 6.9, “RSE and ALAT Interaction” on page 2:146 (since RSE stores do not invalidate ALAT entries). 4. Software explicitly changes the virtual to physical register mapping on stacked registers by switching the RSE backing stores.
moved out of the loop by the compiler. This behavior ensures that if the check load fails on one iteration, then the check load will not necessarily fail on all subsequent iterations. Whenever a new entry is inserted into the ALAT or when the contents of an entry are updated, the information written into the ALAT only uses information from the check load and does not use any residual information from a prior entry.
Each locality hint implies a particular allocation path in the memory hierarchy. The allocation paths corresponding to the locality hints are depicted in Figure 4-2. The allocation path specifies the structures in which the line containing the data being referenced would best be allocated. If the line is already at the same or higher level in the hierarchy no movement occurs.
The following instructions are defined for flush control: flush cache (fc, fc.i) and flush write buffers (fwb). The fc instruction invalidates the cache line in all levels of the memory hierarchy above memory. If the cache line is not consistent with memory, then it is copied into memory before invalidation.
Refer to the description sync.i on page 3:259 Volume 3: Intel® Itanium® Instruction Set Reference for an example of self-modifying code. 4.4.7 Memory Access Ordering Memory data access ordering must satisfy read-after-write (RAW), write-after-write (WAW), and write-after-read (WAR) data dependencies to the same memory location.
Table 4-21 summarizes memory ordering instructions related to cacheable memory. For definitions of the ordering rules related to non-cacheable memory, cache synchronization, and privileged instructions, refer to Section 4.4.7, “Sequentiality Attribute and Ordering” on page 2:82. Table 4-21. Memory Ordering Instructions Mnemonic Operation Ordered load and ordered check load...
iteration is started, and another is finished each time around. During the epilog phase, no new iterations are started, but previous iterations are completed (draining the software pipeline). A predicate is assigned to each stage to control the activation of the instructions in that stage (this predicate is called the “stage predicate”).
Page 88
There are two categories of software-pipelined loop branch types: counted and while. Both categories have two forms: top and exit. The “top” variant is used when the loop decision is located at the bottom of the loop body. A taken branch will continue the loop while a not-taken branch will exit the loop.
only during the epilog phase and is initialized to one more than the number of epilog stages. If the qualifying predicate is zero during the speculative stages of the prolog, EC will be decremented during this part of the prolog, and the initialization value for EC is increased accordingly.
Table 4-28. Predictor Deallocation Hint Completer Operation Don’t deallocate none Deallocate branch information 4.5.3 Branch Predict Instructions Branch predict instructions are entire instructions whose only purpose is to provide early information about future branches. Branch predict instructions provide the following pieces of information: •...
saturation form treats both sources as signed and clamps the result to the limits of a signed range. The unsigned saturation form treats one source as unsigned and clamps the result to the limits of an unsigned range. Two variants are defined that treat the second source as either signed (.uus) or unsigned (.uuu).
Table 4-29. Parallel Arithmetic Instructions (Continued) Mnemonic Operation 1-byte 2-byte 4-byte Parallel shift left and add with signed saturation pshladd Parallel shift right and add with signed saturation pshradd Parallel compare pcmp Parallel signed multiply of odd elements pmpy.l Parallel signed multiply of even elements pmpy.r Parallel signed multiply and shift right pmpyshr...
Table 4-31. Parallel Data Arrangement Instructions Mnemonic Operation 1-byte 2-byte 4-byte Interleave odd elements from both sources mix.l Interleave even elements from both sources mix.r Arbitrary copy of individual source elements Convert from larger to smaller elements with signed saturation pack.sss Convert from larger to smaller elements with unsigned pack.uss...
Page 94
Instructions are provided to transfer between the branch registers and the general registers. The move to branch register instruction can also optionally include branch hints. See “Branch Prediction Hints” on page 1:78. Instructions are defined to transfer between the predicate register file and a general register.
Table 4-33. String Support Instructions Mnemonic Operation 1-byte 2-byte Locate first zero element, left to right czx.l Locate first zero element, right to left czx.r 4.8.2 Bit Strings The population count instruction (popcnt) writes the number of bits that have a value of 1 in the source register into the target register.
Floating-point Programming Model The floating-point architecture is fully compliant with the ANSI/IEEE Standard for Binary Floating-Point Arithmetic (Std. 754-1985). There is full IEEE support for single, double, and double-extended real formats. The two IEEE methods for controlling rounding precision are supported. The first method converts results to the double-extended exponent range.
Real numbers reside in 82-bit floating-point registers in a three-field binary format (see Figure 5-1). The three fields are: • The 64-bit significand field, b contains the number's significant 61 .. digits. This field is composed of an explicit integer bit (significand{63}), and 63 bits of fraction (significand{62:0}).
Page 98
Table 5-2. Floating-point Register Encodings (Continued) Biased Significand Sign Class or Subclass Exponent i.bb...bb (1 bit) (17-bits) (64-bits) (Explicit Integer Bit is Shown) Pseudo-NaNs 0x1FFFF 0.000...01 through 0.111...11 Pseudo-Infinity 0x1FFFF 0.000...00 Normalized Numbers 0x00001 1.000...00 through 1.111...11 (Floating-point Register Format Normals) through 0x1FFFE Integers or Parallel FP...
Table 5-2. Floating-point Register Encodings (Continued) Biased Significand Sign Class or Subclass Exponent i.bb...bb (1 bit) (17-bits) (64-bits) (Explicit Integer Bit is Shown) IA-32 Stack Double Real Denormals 0x00000 0.000...01...(11)0s (produced when computation model is through IA-32 Stack Double) 0.111...11...(11)0s Double-Extended Real Pseudo-Denormals 0x00000 1.000...00 through 1.111...11...
Table 5-3. Floating-point Status Register Field Description Field Bits Description traps.vd Invalid Operation Floating-Point Exception fault (IEEE Trap) disabled when this bit is set traps.dd Denormal/Unnormal Operand Floating-Point Exception fault disabled when this bit is set traps.zd Zero Divide Floating-Point Exception fault (IEEE Trap) disabled when this bit is traps.od Overflow Floating-Point Exception trap (IEEE Trap) disabled when this bit is set traps.ud...
fields flags are merely indications of the occurrence of floating-point excep- tions. Flush-to-Zero (FTZ) mode causes results which encounter “tininess” (see “Definition of Tininess, Inexact and Underflow” on page 1:106) to be truncated to the correctly signed zero. Flush-to-Zero mode can be enabled only if Underflow is disabled. If Underflow is enabled then it takes priority and Flush-to-Zero mode is ignored.
If FPSR.sfx.td is set, the FPSR.traps bits are treated as if they are all set (disabled). Note that FPSR.sf0.td is a reserved field which returns 0 when read. Floating-point Instructions This section describes the floating-point instructions. Refer to Volume 3: Intel® Itanium® Instruction Set Reference for a detailed description. 5.3.1...
The fneg pseudo-operation (see Table 5-15) simply reverses the sign bit of the operand and is therefore not equivalent to the IEEE negation operation. For the IEEE negation operation, an fnma using FR 1 as the multiplicand and FR 0 as the addend must be used.
with the FPSR.sf0.flags and FPSR.traps. If the flags of the alternate status field indicate the occurrence of an event that corresponds to an enabled floating-point exception in FPSR.traps, or an event that is not already registered in the FPSR.sf0.flags (i.e., the flag for that event in FPSR.sf0.flags is clear), then the fchkf instruction branches to recovery code.
Page 113
Exceptions are processed according to a predetermined precedence. Precedence in exception handling means that higher-priority exceptions are flagged first and results are delivered according to the requirements of that exception. Lower-priority exceptions are not flagged even if they occur. For example, dividing an SNaN by zero causes an invalid operation exception (due to the SNaN) and not a zero-divide exception;...
Figure 5-11. Floating-point Exception Fault Prioritization Terminal Decision START State Point NaTVal NaTVal Response Operand? Invalid FP Fault Unsupported ISR.v=1 Enabled? Operand? QNaN Ind FLAGS.v=1 Invalid SNaN FP Fault Enabled? ISR.v=1 Operand? FLAGS.v=1 QNaN Reg prioritized Operand? NaN resp (f4,f2,f3) Invalid FP Fault Other Invalid...
Page 115
5.4.1.3 Floating-point Exception Trap A Floating-point Exception trap occurs if one of the following four circumstances arises: 1. The processor requests system software assistance to complete the operation, via the Software Assist trap 2. The IEEE Overflow trap is enabled and an overflow occurs 3.
then inexactness is signaled. If the significand was rounded by adding a one to its least significant bit, then bit fpa in ISR.code is set to 1. Finally, an interruption due to a Floating-Point Exception trap will occur. Note that when rounding to single, double, or double-extended real, the overflow trap enabled response for normal (non Parallel FP) arithmetic instructions is not guaranteed to be in the range of a valid single, double, or double-extended real quantity, because it is in 17-bit exponent format.
performance on implementations that do not implement denormal handling in hardware. When the Flush-to-Zero mode is enabled, floating-point exception software assist traps will not occur when producing tiny results. 5.4.4 Integer Invalid Operations Floating-point to integer conversions which are invalid (in the IEEE sense) signal an Invalid Operation Floating-Point Exception fault.
Page 119
• The NaTVal is a natural extension of the IEEE concept of NaNs. It is used to support speculative execution. • Flush-to-Zero mode is an industry standard addition. • The minimum and maximum instructions allow the efficient execution of the common Fortran Intrinsic Functions: MIN(), MAX(), AMIN(), AMAX();...
This section does not cover the details of IA-32 application programming model, IA-32 ® instructions and registers. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for details regarding IA-32 application programming model. ® ® Volume 1, Part 1:IA-32 Application Execution Model in an Intel Itanium System Environment 1:109...
• Itanium instructions can access the entire Itanium and IA-32 application register state. This includes IA-32 segment descriptors, selectors, general registers, physical floating-point registers, MMX technology registers, and SSE registers. See ® ® 1:110 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
Page 122
Itanium instruction set. There are two forms; register indirect and absolute. The absolute form computes the Itanium target virtual address as follows: ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:111...
Page 123
Itanium instruction set into IA-32 VM86, Real Mode or Protected Mode. While jmpe and interruptions will transition the processor from either IA-32 VM86, Real Mode or ® ® 1:112 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
To promote straight-forward parameter passing, integer and IEEE floating-point register and memory data types are binary compatible between both IA-32 and Itanium instruction sets. ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:113...
• Undefined: Registers marked as undefined may be used as scratch areas for execution of IA-32 instructions by the processor and are not ensured to be preserved across instruction set transitions. ® ® 1:114 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
Instruction Pointer Floating-point Registers constant +0.0 constant +1.0 ® ® FR2-5 unmodified Intel Itanium preserved registers FR6-7 undefined IA-32 code execution space ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:115...
Page 127
IA-32 time stamp counter (TSC) ® ® and Intel Itanium Interval Timer unmodified RUC continues to count while in IA-32 execution mode ® ® 1:116 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
IP is a 64-bit virtual pointer shared with the Itanium instruction set. The following relationship is defined between EIP and IP while executing IA-32 instructions. IP{63:32} = 0; IP{31:0} = EIP{31:0} + CSD.Base; ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:117...
® type 55:52 Type identifier for data/code segments, including the Access bit (bit 52). See the Intel 64 and IA-32 Architectures Software Developer’s Manual for encodings and definition. Non System Segment. If 1, a data segment, if 0 a system segment.
Page 130
32-bits, otherwise 16-bits. Segment Limit Granularity. If 1, scales the segment limit by lim=(lim<<12) | 0xFFF for ® ® IA-32 instruction set memory references. This field is ignored for Intel Itanium instruction set memory references. 6.2.2.3.1 Data and Code Segments...
Segment limit should be set to 0xFFFF for normal RM 64KB operation. f. For valid segments the p-bit should be set to 1, for null segments the p-bit should be set to 0. ® ® 1:120 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
Page 132
• Itanium architecture-based software should ensure PSR.cpl is 0 • Itanium architecture-based software should ensure the stack segment descriptor register’s DPL is 0. ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:121...
Stack Fault references to SS read and not readable, write and not writeable s, p, a-bits are not 1 g-bit/limit segment limit violation ® ® 1:122 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
These flags are ignored by Itanium instructions. Flags ID, OF, DF, SF, ZF, ® AF, PF and CF are defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual. ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:123...
IA-32 floating-point register stack, numeric controls and environment are mapped into the Itanium floating-point registers FR8 - FR15 and the application register name space as shown in Table 6-6. ® ® 1:124 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
IA-32 Floating-point Stack IA-32 floating-point registers are defined as follows: • IA-32 numeric register stack is mapped to FR8 - FR15, using the Intel 8087 80-bit IEEE floating-point format. • For IA-32 instruction set references, floating-point registers are logically mapped into FR8 - FR15 based on the IA-32 top-of-stack (TOS) pointer held in FCR.top.
Page 137
Nan, Infinity or Denormal of each IA-32 logical floating-point register are not supported. However, IA-32 instruction set reads of FTW compute the additional special ® ® 1:126 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
Intel Itanium Usage in the Intel IA-32 State Bits IA-32 Usage ® State Itanium Architecture FSW, FTW, MXCSR state in the FSR Register ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:127...
Page 139
6.2.2.5.4 IA-32 Floating-point Environment To support the Intel 8087 delayed numeric exception model, FSR, FDR and FIR contain pending information related to the numeric exception. FDR contains the operand’s effective address and segment selector. FIR contains the numeric instruction’s effective address, code segment selector, and opcode bits.
IA-32 Intel Technology Registers The eight IA-32 Intel MMX technology registers are mapped on the eight Itanium floating-point registers FR8 - FR15 where MM0 is mapped to FR8 and MM7 is mapped to FR15. The MMX technology register mapping for the IA-32 floating-point stack view is dependent on the floating-point IA-32 Top-of-Stack value.
To avoid performance degradation, software programmers are strongly recommended ® not to intermix IA-32 floating and IA-32 MMX technology instructions. See the Intel 64 and IA-32 Architectures Software Developer’s Manual for MMX technology coding guidelines for details. 6.2.2.7 IA-32 SSE Registers The eight 128-bit IA-32 SSE registers (XMM0-7) are mapped on sixteen physical Itanium floating-point register pairs FR16 - FR31.
Starting 32-bit virtual addresses are truncated to 32-bits after the addition of the segment base. Ending virtual address ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:131...
Page 143
• All IA-32 stores have release semantics • All IA-32 loads have acquire semantics • All IA-32 read-modify-write or lock instructions have release and acquire semantics (fully fenced). ® ® 1:132 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
Page 144
IA-32 code, existing entries in the ALAT are ignored. For details on the ALAT, refer to Section 4.4.5.2, “Data Speculation and Instructions” on page 1:64. ® ® Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment 1:133...
Page 145
Software should not rely on the behavior of NaT or NaTVal during IA-32 instruction execution, or propagate NaT or NaTVal into IA-32 instructions. § ® ® 1:134 Volume 1, Part 1: IA-32 Application Execution Model in an Intel Itanium System Environment...
Itanium instruction set. It is intended for those interested in furthering their understanding of application architecture features and optimization techniques that benefit application performance. Intel and the industry are developing compilers to take advantage of these techniques. Application developers are not advised to use this as a guide to assembly language programming for the Itanium architecture.
Page 149
1:138 Volume 1, Part 2: About the Optimization Guide...
) that are used for f0-f127 floating-point computations. The first two registers, , are read-only and read as +0.0 and +1.0, respectively. Instructions that write to will fault. ® ® Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture 1:139...
(RAW) or write after write (WAW) register dependencies. Instruction groups are delimited by stops in the assembly source code. Since instruction groups have no RAW ® ® 1:140 Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture...
When the value is false (0), the processor discards any results and raises no exceptions. Consider the following C code: if (a) { b = c + d; if (e) { h = i + j; ® ® Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture 1:143...
Branches and Hints Since branches have a major impact on program performance, the Itanium architecture includes features to improve their performance by: ® ® 1:144 Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture...
Thus, after one rotation, the content of register will be found in register and the value of the highest numbered rotating register ® ® Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture 1:145...
• Reduced overhead for procedure calls through the register stack mechanism. • Streamlined loop handling through hardware support of software pipelined loops. • Support for hiding memory latency using speculation. § ® ® 1:146 Volume 1, Part 2: Introduction to Programming for the Intel Itanium Architecture...
Memory Reference Overview Memory latency is a major factor in determining the performance of integer applications. In order to help reduce the effects of memory latency, the Itanium architecture explicitly supports software pipelining, large register files, and compiler-controlled speculation. This chapter discusses features and optimizations related to compiler-controlled speculation.
3.2.3 Data Prefetch Hint The lfetch instruction requests that lines be moved between different levels of the memory hierarchy. Like all hint instructions defined in the Itanium architecture, lfetch has no effect on program correctness, and any microarchitecture implementation may choose to ignore it.
A compiler cannot safely move the load instruction before the branch unless it can guarantee that the moved load will not cause a fatal program fault or otherwise corrupt program state. Since the load cannot be moved upward, the schedule cannot be improved using normal code motion.
® ® 3.3.2.2 Data Dependency in the Intel Itanium Architecture The Itanium architecture requires the programmer to insert stops between RAW and WAW register dependencies to ensure correct code results. For example, in the code below, the add instruction computes a value in r4 needed by the sub instruction: r4=r5,r6 ;;...
Page 162
*ptr1 = 6; x = *ptr2; ® ® Using Speculation in the Intel Itanium Architecture to Overcome Dependencies Both data and control dependencies constrain optimization of program code. The Itanium architecture provides support for two basic techniques used to overcome dependencies: •...
® ® 3.4.2 Using Data Speculation in the Intel Itanium Architecture Data speculation in the Itanium architecture uses a special load instruction (ld.a) called an advanced load instruction and an associated check instruction (chk.a or ld.c) to validate data-speculated results.
Page 164
If no matching entry is found, the speculative results need to be recomputed: • Use a chk.a if a load and some of its uses are speculated. The chk.a jumps to compiler-generated recovery code to re-execute the load and dependent instructions.
Page 165
The compiler could move up not only the load, but also one or more of its uses. This transformation uses a chk.a rather than a ld.c instruction to validate the advanced load. Using the same example code sequence but now advancing the add as well as the ld8 results in: ld8.a r6=[r8];;...
Page 166
® ® 3.4.3 Using Control Speculation in the Intel Itanium Architecture The check to determine if control speculation was successful is similar to that for data speculation. 3.4.3.1 The NaT Bit The Not A Thing (NaT) bit is an extra bit on each of the general registers. A register NaT bit indicates whether the content of a register is valid.
Although every speculative computation needs to be checked, this does not mean that every speculative load requires its own chk.s. Speculative checks can be optimized by taking advantage of the propagation of NaT bits through registers as described in Section 3.5.6.
Page 168
Optimization of Memory References Speculation can increase parallelism and help to hide latency by enabling more code motion than can be performed on traditional architectures. Speculation can increase the application of traditional loop optimizations such as invariant code motion and common subexpression elimination.
Page 169
3.5.2 Data Interference Data references with low interference probabilities and high path probabilities can make the best use of data speculation. In the pseudo-code below, assume the probabilities that the stores to *p1 and *p2 conflict with var are independent. *p1 = /* Prob interference = 0.30 */ .
memory conflicts, or aliasing in the ALAT, the decision as to where to place recovery code for advanced loads is more difficult than for control speculation and should be based on the expected conflict rate for each load. As a general rule, efficient compilers will attempt to minimize code growth related to speculation.
Page 171
A disadvantage of post-increment loads is that they create new dependencies between post-increment loads and the operations that use the post-increment values. In some cases, the compiler may wish to separate post-increment loads into their component instructions to improve the overall schedule. Alternatively, the compiler could wait until after instruction scheduling and then opportunistically find places where post-increment loads could be substituted for separate load and add instructions.
3.5.6 Minimizing Check Code Checks of speculative loads can sometimes be combined to reduce code size. The propagation of NaT bits and NaTVals via speculative instructions can permit a single check of a speculative result to replace multiple intermediate checks. The code below demonstrates this optimization potential: ld4.s r1=[r10]...
Page 173
Summary The examples in this chapter show where the Itanium architecture can take advantage of existing techniques like dynamic profiling and disambiguation. Special architectural support allows implementation of speculation in common scenarios in which it would normally not be allowed. Speculation, in turn, increases ILP by making greater code motion possible, thus enhancing traditional optimizations such as those involving loops.
Page 174
Predication, Control Flow, and Instruction Stream Overview This chapter is divided into three sections that describe optimizations related to predication, control flow, and branch hints as follows: • The predication section describes if-conversion, predicate usage, and code scheduling to reduce the affects of branching. •...
Page 175
® ® 4.2.2 Predication in the Intel Itanium Architecture Now that the performance implications of branching have been described, this section overviews predication in the Itanium architecture – the primary mechanism used by optimizations described in this section.
Page 176
Almost all Itanium instructions can be tagged with a guarding predicate. If the value of the guarding predicate is false at execution time, then the predicated instruction’s architectural updates are suppressed, and the instruction behaves like a nop. If the predicate is true, then the instruction behaves as if it were unpredicated.
Page 177
The process of predicating instructions in conditional blocks and removing branches is referred to as if-conversion. Once if-conversion has been performed, instructions can be scheduled more freely because there are fewer branches to limit code motion, and there are fewer branches competing for issue slots. In addition to removing branches, this transformation will make dynamic instruction fetching more efficient since there are fewer possibilities for control flow changes.
Figure 4-1. Flow Graph Illustrating Opportunities for Off-path Predication Block B Block A If some of the instructions in block A or block B can be included in the main trace without increasing its critical path, then techniques of upward code motion can be applied to reduce the critical path through blocks A and B when they are taken.
Page 179
4.2.3.4 Downward Code Motion As with upward code motion, downward code motion is normally difficult in the presence of stores. The next example shows how code can be moved downward past a label, a transformation that is often unsafe without predication: r56 = [r45];;...
Page 180
4.2.4.1 Unbalanced Execution Paths The simple conditional below has an unbalanced flow-dependency height. Suppose that non-predicated assembly for this sequence takes two clocks for the if-block and approximately 18 clocks if we assume a setf takes 8 clocks, a getf takes 2 clocks, and an xma takes 6 clocks: if (r4) // 2 clocks...
Page 181
4.2.4.4 Case 3 Suppose the if-clause is executed 30% of the time and the branch mispredicts 30% of the time. The average number of clocks for: • Unpredicated code is: (2 cycles * 30%) + (18 cycles * 70%) + (10 cycles * 30%) = 16.2 clocks •...
Page 182
4.2.5 Guidelines for Removing Branches The following if-conversion guidelines apply to cases where only local behavior of the code and its execution profile are known: 1. The flow dependency and resource availability heights of both paths must be considered when deciding whether to predicate or not. 2.
Page 183
4.3.1 Reducing Critical Path with Parallel Compares The computation of the compound branch condition shown below requires several instructions on processors without special instructions: if ( rA || rB || rC || rD ) { /* If-block instructions */ /* after if-block */ The pseudo-code below, shows one possible solution uses a sequence of branches: cmp.ne p1,p0 = rA,0 cmp.ne p2,p0 = rB,0...
Page 184
Initialization code must be placed in an instruction group prior to the parallel compare. However, since the initialization code has no dependencies on prior values, it can generally be scheduled without contributing to the critical path of the code. The instructions below shows how to generate code for the example above using parallel compares: cmp.ne p1,p0 = r0,r0;;...
Page 185
An example uses a basic block with four possible successors. The following Itanium architecture-based multi-target branch code uses a BBB bundle template and can branch to either block B, block C, block D, or fall through to block A: label_AA: ...
Page 186
The Itanium architecture allows multiple instructions to target the same register in the same clock provided that only one of the instructions writing the target register is predicated true in that clock. Similar capabilities exist for writing predicate registers, as discussed in Section 4.3.1.
Page 187
By using predication to reduce the number of control flow changes, the fetching efficiency will generally improve. The only case where predication is likely to reduce instruction cache efficiency is when there is a large increase in the number of instructions fetched which are subsequently predicated off.
Page 188
Two types of branch-related hints are defined by the Itanium architecture: branch prediction hints and instruction prefetch hints. Branch prediction hints let the compiler recommend the resources (if any) that should be used to dynamically predict specific branches. With prefetch hints, the compiler can indicate the areas of the code that should be prefetched to reduce demand I-cache misses.
Page 189
This scenario can be hinted to the processor by executing an advanced load (ld.a or ld.sa) to the address that this software thread is waiting on, and then by executing a hint @pause instruction (in a subsequent instruction group). This encourages the processor to devote more resources to other threads, yet if an entry is invalidated from this thread's ALAT, normal processor resource allocation is resumed for this thread.
Page 190
Resource allocation within the processor eventually reverts to a fair allocation, so there's no need for software to hint that it is no longer in a critical section. Processors that support this hint also ensure that it cannot be abused to affect overall longer-term fairness of processor resource allocation.
Page 191
1:180 Volume 1, Part 2: Predication, Control Flow, and Instruction Stream...
Software Pipelining and Loop Support Overview The Itanium architecture provides extensive support for software-pipelined loops, including register rotation, special loop branches, and application registers. When combined with predication and support for speculation, these features help to reduce code expansion, path length, and branch mispredictions for loops that can be software pipelined.
Page 193
This section describes two general methods for overlapping loop iterations, both of which result in code expansion on traditional architectures. The code expansion problem is addressed by loop support features in the Itanium architecture that are explored later in this chapter. The loop above will be used as a running example in the next few sections.
Page 194
utilization can be increased by unrolling the loop more times, but at the cost of further code expansion. The loop below is unrolled four times (assuming the trip count is multiple of four): r15 = 4,r5 r25 = 8,r5 r35 = 12,r5 r16 = 4,r6 r26 = 8,r6 r36 = 12,r6;;...
Page 195
® ® Loop Support Features in the Intel Itanium Architecture The code expansion that results from loop optimizations (such as software pipelining and loop unrolling) on traditional architectures can increase the number of instruction cache misses, thus reducing overall performance.
Page 196
Itanium architecture allow some loops to be software pipelined without code expansion. Register rotation provides a renaming mechanism that reduces the need for loop unrolling and software renaming of registers. Special software pipelined loop branches support register rotation and, combined with predication, reduce the need to generate separate blocks of code for the prolog and epilog phases.
Page 197
for the same source iteration. Each one written to p16 sequentially enables all the stages for a new source iteration. This behavior is used to enable or disable the execution of the stages of the pipelined loop during the prolog, kernel, and epilog phases as described in the next section.
and a decision is made to exit the loop. The special case in which a software-pipelined loop branch is executed with EC equal to 0 can occur in unrolled software-pipelined loops if the target of the cexit branch is set to the next sequential bundle. Figure 5-1.
Note: Rotating GRs have now been included in the code (the code directly preceding did not). Also, induction variables that are post incremented must be allocated to the static portion of the register file: lc = 199 // LC =loop count - 1 ec = 4 // EC =epilog stages + 1 pr.rot = 1<<16;;...
There are a few differences in the operation of the while loop branch compared to the counted loop branch. The while loop branch does not access LC — a branch predicate determines the behavior of this branch instead. During the kernel and epilog phases, the branch predicate is one and zero respectively.
Page 201
Value that is incremented (or decremented) once per source iteration by the same amount. ® ® Optimization of Loops in the Intel Itanium Architecture Register rotation, predication, and the software pipelined loop branches allow the generation of compact, yet highly parallel code. Speculation can further increase loop performance by removing dependency barriers that limit the throughput of software pipelined loops.
Notice that the load for the second source iteration is executed before the compare and branch of the first source iteration. That is, the load (and the update of r5) is speculative. The loop condition is not computed until cycle X+2, but in order to maximize the use of resources, it is desirable to start the second source iteration at cycle X+1.
Page 203
Table 5-2. wtop Loop Trace Port/Instructions State before br.wtop Cycle ld4.s br.wtop … … … … … … … … … ld4.s br.wtop … … … … … … … … … ld4.s br.wtop ld4.s br.wtop ld4.s br.wtop The executions of br.wtop in the first two cycles of the prolog do not correspond to any of the source iterations.
Page 204
Below is a possible pipeline with an II of 2, assuming a floating-point load latency of 9 cycles: stage 1: (p16) ldfs f4 = [r5],4 (p16) ldfs f9 = [r8],4;; // empty cycle stage 2-4: --- // empty stages stage 5: // empty cycle (p20) fcmp.ge.unc p1,p2 = f4,f9;;...
Page 205
5.5.3.1 Converting Multiple Exit Loops to Single Exit Loops The first is to transform the multiple exit loop into a single exit loop. In the source loop, execution of the add, the second compare and the second branch is guarded by the first branch.
Page 206
5.5.3.2 Pipelining with Explicit Multiple Exits The second approach is to combine the last three instructions in the loop into a br.cloop instruction and then pipeline the loop. The pipeline using this approach is shown below: stage 1: ld4.s r4 = [r5],4;; // II = 1 stage 4: ld4.s r9 = [r4];;...
Page 207
The following is a possible pipeline with an II of 2: stage 1: r4 = [r5],4 // Cycle 0 r7 = [r8],4;; // Cycle 0 // empty cycle stage 2: // empty cycle [r6] = r4,4 // Cycle 3 [r9] = r7,4;; // Cycle 3 In the source loop, one iteration is completed every three cycles.
Page 208
5.5.5.2 Conflicts in the ALAT Using an advanced load to remove a likely invariant load from a loop while advancing another load inside the loop results in poor performance if the latter load targets a rotating register. The advanced load that targets the rotating register will eventually invalidate the ALAT entry for the loop invariant load.
Page 209
5.5.6 Loop Unrolling Prior to Software Pipelining In some cases, higher performance can be achieved by unrolling the loop prior to software pipelining. Loops that are resource constrained can be improved by unrolling such that the limiting resource is more fully utilized. In the following example if we assume the target processor has only two memory units, the loop performance is bound by the number of memory units: r4 = [r5],4...
Page 210
predicate for the odd iteration is in predicate register X, the stage predicate for the even iteration is in predicate register X-1. The pseudo-code to implement this pipeline assuming an unknown trip count is shown below: r15 = r5,4 r18 = r8,4 lc = r2 // LC = loop count - 1 ec = 4...
Page 211
If the loop trip count is even, two epilog stages are executed and the kernel loop is exited at the br.ctop. If the trip count is odd, the first two epilog stages are executed and then the br.cexit branch is taken. Because the target of the br.cexit branch is the next sequential bundle (L4), a third epilog stage is executed before the kernel loop is exited at the br.ctop.
Page 212
This loop maintains five independent sums in registers f33-f37. The fma instruction in iteration X produces a result that is used by the fma instruction in iteration X+5. Iterations X through X+4 are independent, allowing an II of one to be achieved. code for a pipelined version of the loop assuming two memory ports and a nine cycle latency for a floating-point load is shown below: lc = 199...
Page 213
Note that, in the code above, the ld4 and the add instructions in stage 2 have been reordered. Register rotation has been used to eliminate the WAR register dependency from the add to the ld4. The first two stages are speculative. The code to implement the pipeline is shown below: r36 = [r5] ec = 2...
Page 214
under-utilized during the prolog and epilog phases. Part of the prolog and epilog could be peeled off and merged with the code preceding and following the loop. following is a pipelined version of that counted loop with an explicit prolog and epilog: lc = 196 ec = 1 prolog:...
Page 215
5.5.9 Redundant Load Elimination in Loops Unrolling of a loop is sometimes necessary to remove copy operations created by loop optimizations. The following is an example of redundant load elimination. In the code below, each iteration loads two values, one of which has already been loaded by the previous source iteration: r8 = r5,4;;...
Floating-point Applications Overview The Itanium floating-point architecture is fully ANSI/IEEE-754 standard compliant and provides performance enhancing features such as the fused multiply accumulate instruction, the large floating-point register file (with static and rotating sections), the extended range register file data representation, the multiple independent floating-point status fields, and the high bandwidth memory access instructions that enable the creation of compact, high performance, floating-point application code.
Page 217
6.2.2 Execution Bandwidth When sufficient ILP exists and can be exploited, the performance limitation is the availability of the execution resources – or the execution bandwidth of the machine. Consider the dense matrix multiply kernel from the BLAS3 library. DO 1 i = 1, N DO 1 j = 1, P DO 1 k = 1, M C[i,j] = C[i,j] + A[i,k]*B[k,j]...
Page 218
® ® Floating-point Features in the Intel Itanium Architecture This section highlights architectural features that reduce the impact of the performance limiters described in Section 6.2...
Page 219
Here, three registers are required to hold the operands (f5, f6) and the accumulator (f7). By recognizing the reuse of A[i,k] for different B[k,j] as j is varied, and the reuse of B[k,j] for different A[i,k] as i is varied, the computation can be restructured DO 1 i = 1, N, 2 DO 1 j = 1, P, 2 DO 1 k = 1, M...
Page 220
If we suppose the minimum floating-point load latency is 9 clocks, and 2 memory operations can be issued per clock, the above loop has to be unrolled by at least six if there is no register rotation. r8 = r7, 8 (p18) [r7] = f25, 16 // Cycle 17,26...
Page 221
inputs that might be single precision numbers. With the rounding performed at the 64th precision bit (instead of the 24th for single precision) a smaller error is accumulated with each multiply and add. Furthermore, with 17 bits of range (instead of 8 bits for single precision) large positive and negative products can be added to the accumulator without overflow or underflow.
6.3.3 Software Divide/Square Root Sequence To perform division or square root operations on the Itanium architecture, a software-based sequence of operations is used. The sequence consists of obtaining an initial guess (using frcpa/frsqrta instruction) and then refining the guess by performing Newton-Raphson iterations until the error is sufficiently small so that it may not affect the rounding of the result.
For divide, the first instruction (frcpa) provides an approximation (good to 8 bits) of the reciprocal of f7 and sets the predicate (p6) to 1, if the ratio f6/f7 can be obtained using the prescribed Newton-Raphson iterations. If, however, the ratio f6/f7 is special (finite/0, finite/infinite, etc) the final result of f6/f7 is provided in f8 and the predicate (p6) is cleared.
6.3.5 Multiple Status Fields The FPSR is divided into one main (architectural) status field and three additional identical status fields. These additional status fields could be used to performance advantage. First, divide and square-root sequences (described in Section 6.3.3) contain operations that might cause intermediate results to overflow/underflow or be inexact even if the final result may not.
The availability of multiple additional status fields can allow a user to maintain multiple computational environments and to dynamically select among them on an operation by operation basis. One such use is in the implementation of interval arithmetic code where each primitive operation is required to be computed in two different rounding modes to determine the interval of the result.
Page 226
Since NaNs are unordered, comparison with NaNs (including LT) will return false. Hence if the above code is implemented as: f5 = [r5], 8;; L1: ldf f6 = [r5], 8 fmin f5 = f6, f5 br.cloop L1 ;; NaNs in the array (X) will be ignored. If the value in the array X (loaded in f6) is a NaN, the new minimum value (in f5) will remain unchanged, since the NaN will fail the.LT.
architecture provides instructions that allow moving floating-point fields between the integer and floating-point register files. Division of a floating-point number by 2.0 is accomplished as follows: getf.exp = f5 // Move S+Exp to int = r5, -1 // Sub 1 from Exp setf.exp = r5 // Move S+Exp to FP...
The inner loop consists of two loads (for A and B) and a multiply-add (to accumulate the product on C). The loop would run at the latency of the fma due to the recurrence on C. In order to break the recurrence on C, the loop is typically unrolled and multiple partial accumulators are used.
Page 229
support in the Itanium architecture beyond the software-pipelining support described in Chapter 5, “Software Pipelining and Loop Support” that help to overcome some of these performance limiters. Architectural support for speculation, rounding, and precision control are also described. Examples in the chapter include how to implement floating-point division and square root, common scientific computations such as reductions, use of features such as the fma instruction, and various Livermore kernels.
® ® Intel Itanium Architecture Software Developer’s Manual Volume 2: System Architecture Revision 2.3 May 2010 Document Number: 245318...
Page 232
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling1-800-548-4725, or by visiting Intel's website at http://www.intel.com.
Part 1: Application Architecture Guide ......2:3 1.1.2 Part 2: Optimization Guide for the Intel® Itanium® Architecture ..2:3 Overview of Volume 2: System Architecture.
Page 242
Interaction of Ordering and Accesses to Sequential Locations ..... . 2:524 ® ® Why a Fence During Context Switches is Required in the Intel Itanium Architecture . . . 2:526 Spin Lock Code .
Page 245
Hardware policies returned in cur_policy ......2:395 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
Page 246
Architecture Provides a Relaxed Ordering Model ... . . 2:512 ® ® Acquire and Release Semantics Order Intel Itanium Memory Operations ..2:513 Loads May Pass Stores to Different Locations .
Page 247
Interruption Handler Execution Environment (PSR and RSE.CFLE Settings)..2:540 ® ® Preserving Intel Itanium General and Floating-point Registers ....2:549 Register State Preservation at Different Points in the OS .
Page 251
IA-32 application interface. This volume also describes optimization techniques used to generate high performance software. 1.1.1 Part 1: Application Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. ® ®...
Page 252
1.2.1 Part 1: System Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. ® ®...
Page 253
Chapter 9, “IA-32 Interruption Vector Descriptions” lists IA-32 exceptions, interrupts and intercepts that can occur during IA-32 instruction set execution in the Itanium System Environment. ® Chapter 10, “Itanium Architecture-based Operating System Interaction Model with IA-32 Applications” defines the operation of IA-32 instructions within the Itanium System Environment from the perspective of an Itanium architecture-based operating system.
Page 254
Instruction Set Reference This volume is a comprehensive reference to the Itanium instruction set, including instruction format/encoding. ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. Chapter 2, “Instruction Reference”...
Page 255
These resources include instructions and registers. Itanium Architecture – The new ISA with 64-bit instruction capabilities, new performance- enhancing features, and support for the IA-32 instruction set. IA-32 Architecture – The 32-bit and 16-bit Intel architecture as described in the ® Intel 64 and IA-32 Architectures Software Developer’s Manual.
Page 256
® • Intel 64 and IA-32 Architectures Software Developer’s Manual – This set of manuals describes the Intel 32-bit architecture. They are available from the Intel Literature Department by calling 1-800-548-4725 and requesting Document Numbers 243190, 243191and 243192. ® ®...
Page 257
Date of Revision Description Revision Number August 2005 Allow register fields in CR.LID register to be read-only and CR.LID checking on interruption messages by processors optional. See Vol 2, Part I, Ch 5 “Interruptions” and Section 11.2.2 PALE_RESET Exit State for details. Relaxed reserved and ignored fields checkings in IA-32 application registers in Vol 1 Ch 6 and Vol 2, Part I, Ch 10.
Page 258
Date of Revision Description Revision Number August 2002 Added Predicate Behavior of alloc Instruction Clarification (Section 4.1.2, Part I, Volume 1; Section 2.2, Part I, Volume 3). Added New fc.i Instruction (Section 4.4.6.1, and 4.4.6.2, Part I, Volume 1; Section 4.3.3, 4.4.1, 4.4.5, 4.4.6, 4.4.7, 5.5.2, and 7.1.2, Part I, Volume 2; Section 2.5, 2.5.1, 2.5.2, 2.5.3, and 4.5.2.1, Part II, Volume 2;...
Page 259
Date of Revision Description Revision Number Volume 2: Class pr-writers-int clarification (Table A-5). PAL_MC_DRAIN clarification (Section 4.4.6.1). VHPT walk and forward progress change (Section 4.1.1.2). IA-32 IBR/DBR match clarification (Section 7.1.1). ISR figure changes (pp. 8-5, 8-26, 8-33 and 8-36). PAL_CACHE_FLUSH return argument change –...
Page 260
Date of Revision Description Revision Number Volume 2: Clarifications regarding “reserved” fields in ITIR (Chapter 3). Instruction and Data translation must be enabled for executing IA-32 instructions (Chapters 3,4 and 10). FCR/FDR mappings, and clarification to the value of PSR.ri after an RFI (Chapters 3 and 4).
® ® Reset (Intel Itanium Instructions) Platform Test & Initialization ® ® (Intel Itanium IA-32 Instructions) ® Itanium architecture-based OS Boot ® ® (Intel Itanium Instructions & IA-32 Instructions) ® ® Volume 2, Part 1: Intel Itanium System Environment 2:13...
Page 262
• Chapter 7, “Debugging and Performance Monitoring” describes debug and performance monitoring hooks. • Chapter 8, “Interruption Vector Descriptions” describes interruption handler entry points. ® ® 2:14 Volume 2, Part 1: Intel Itanium System Environment...
Page 263
Chapter 9 describes IA-32 interruption handler entry points. • Chapter 10, “Itanium® Architecture-based Operating System Interaction Model with IA-32 Applications”describes how IA-32 applications interact with Itanium architecture-based operating systems. § ® ® Volume 2, Part 1: Intel Itanium System Environment 2:15...
Page 264
® ® 2:16 Volume 2, Part 1: Intel Itanium System Environment...
Page 265
System State and Programming Model This chapter describes the architectural state visible only to an operating system and defines system state programming models. It covers the functional descriptions of all the system state registers, descriptions of individual fields in each register, and their serialization requirements.
Page 266
serialization requirements. This approach simplifies hardware and allows for more efficient software operations. For example, during a low level context switch where there is no immediate use of loaded system registers, these registers can be loaded without any serialization overhead. To ensure side effects are observed before a dependent instruction is fetched or executed, two serialization operations are provided: instruction serialization and data serialization.
Page 267
The control registers are different from the general registers and other registers. Most control registers require an explicit data serialization between the writing of a control register and the reading of that same control register. (See Table 3-3 on page 2:29 serialization requirements for specific control registers.) The Data Serialize (srlz.d) instruction performs explicit data serialization.
Page 268
System State The architecture provides a rich set of system register resources for process control, interruptions handling, protection, debugging, and performance monitoring. This section gives an overview of these resources. 3.3.1 System State Overview Figure 3-1 shows the set of all defined privileged system register resources. Application state as defined in “Application Register State”...
Page 269
• Region Registers (RR) – Eight 64-bit region registers specify the identifiers and preferred page sizes for multiple virtual address spaces. Refer to “Region Registers (RR)” on page 2:58 for complete information. • Protection Key Registers (PKR) – At least sixteen 64-bit protection key registers contain protection keys and read, write, execute permissions for virtual memory protection domains.
Page 270
Figure 3-1. System Register Model Application Registers APPLICATION REGISTER SET General Registers Floating-point Registers Branch Registers NaTs Predicates +0.0 +1.0 Banked BSPSTORE RNAT Instruction Pointer EFLAG Current Frame Marker CFLG User Mask UNAT Performance Monitor Advanced Load FPSR Data Registers Processor Identifiers Address Table cpuid...
Page 271
3.3.2 Processor Status Register (PSR) The PSR maintains the current execution environment. The PSR is divided into four overlapping sections (See Figure 3-2): user mask bits (PSR{5:0}), system mask bits (PSR{23:0}), the lower half (PSR{31:0}), and the entire PSR (PSR{63:0}). PSR fields are defined in Table 3-2 along with serialization requirements for modification of each...
Page 272
Lower (f2 .. f31) floating-point registers written – This bit unchanged data is set to one when an Intel Itanium instruction completes that uses register f2..f31 as a target register. This bit is sticky and only cleared by an explicit write of the user mask.
Page 273
Upper (f32 .. f127) floating-point registers written – This unchanged data bit is set to one when an Intel Itanium instruction completes that uses register f32..f127 as a target register. This bit is sticky and only cleared by an explicit write of the user mask.
Page 274
Table 3-2. Processor Status Register Fields (Continued) Interruption Serialization Field Bits Description State Required Disabled Floating-point High register set – When 1, a data read or write access to f32 through f127 results in a Disabled Floating-Point Register fault. When 1, a Disabled FP Register fault is raised on the first IA-32 target instruction following a br.ia or rfi, regardless whether f32-127 are referenced.
Page 275
PSR.cpl is unchanged by the jmpe and br.ia instructions. PSR.cpl cannot be updated by any IA-32 instructions. Instruction Set – When 0, Intel Itanium instructions are , br.ia executing. When 1, IA-32 instructions are executing. Written by the rfi and br.ia instructions and the IA-32 jmpe instruction.
Page 276
Table 3-2. Processor Status Register Fields (Continued) Interruption Serialization Field Bits Description State Required Single Step enable – When 1, a Single Step trap occurs following the successful execution of the first restart instruction in the current bundle. Instruction slots 0, 1, and 2 can be single stepped.
Page 277
a. User mask bits are implicitly serialized if accessed via user mask instructions; sum, rum, and move to User Mask. If modified with system mask instructions; rsm, ssm and move to PSR.l, software must explicitly serialize to ensure side effects are observed before dependent instructions. b.
Page 278
Table 3-3. Control Registers (Continued) Serialization Register Name Description Required Interruption CR16 IPSR Interruption Processor Status Register implied Control CR17 Interruption Status Register implied Registers CR18 reserved CR19 Interruption Instruction Pointer implied CR20 Interruption Faulting Address implied CR21 ITIR Interruption TLB Insertion Register implied CR22 IIPA...
Page 279
All unaligned Intel Itanium semaphore references generate an Unaligned Data Reference fault. All aligned Intel Itanium semaphore references made to memory that is neither write-back cacheable nor a NaTPage result in an Unsupported Data Reference fault.
Page 280
Table 3-5. Default Control Register Fields (Continued) Serialization Field Description Required Defer Key Miss faults only – When 1, and a Key Miss fault is deferred, data lower priority Access Bit, Access Rights or Debug faults may still be delivered. A Key Miss fault, deferred or not, precludes concurrent Key Permission faults.
Page 281
A sequence of reads of the ITC is guaranteed to return ever-increasing values (except for the case of the counter wrapping back to 0) corresponding to the program order of the reads. Applications can directly sample the ITC for time-based calculations. A 64-bit overflow condition can occur without notification.
Page 282
A sequence of reads of the RUC is guaranteed to return ever-increasing values (except for the case of the counter wrapping back to 0) corresponding to the program order of the reads. Applications can directly sample the RUC for active-running-time calculations.
Page 283
3.3.4.5 Interruption Vector Address (IVA – CR2) The IVA specifies the location of the interruption vector table in the virtual address space, or the physical address space if PSR.it is 0, see Figure 3-7. The size of the vector table is 32K bytes and is 32K byte aligned. The lower 15 bits of the IVA are ignored when written, reads return zeros.
Page 284
3.3.5 Interruption Control Registers Registers CR16 - CR27 record information at the time of an interruption (including from the IA-32 instruction set) and are used by handlers to process the interruption. The interruption control registers can only be read or written while PSR.ic is 0; otherwise, an Illegal Operation fault is raised.
Page 285
(the processor was performing a data memory accesses to the IDT, GDT, LDT or TSS segments) or an IA-32 data memory access at a privilege level of zero. This bit is always 0 for interruptions taken while executing Intel Itanium instructions.
Page 286
Figure 3-10, all 64-bits of the IIP must be implemented regardless of the size of the physical and virtual address space supported by the processor model (see “Unimplemented Address Bits” on page 2:73). IIP also receives byte-aligned IA-32 instruction pointers. The IIP, IPSR and IFS are used to restore processor state on a Return From Interruption instruction (rfi).
Page 287
faulting instruction and IIP points to the first byte of the faulting instruction, or (2) for faults on the second page, IFA contains the bundle address of the second virtual page and IIP points to the first byte of the faulting IA-32 instruction. The IFA also specifies a translation’s virtual address when a translation entry is inserted into the instruction or data TLB.
Page 288
3.3.5.6 Interruption Instruction Previous Address (IIPA – CR22) For Itanium instructions, IIPA records the last successfully executed instruction bundle address. For IA-32 instructions, IIPA records the byte granular virtual instruction address zero extended to 64-bits of the faulting or trapping IA-32 instruction. In the case of a fault, IIPA does not report the address of the last successfully executed IA-32 instruction, but rather the address of the faulting IA-32 instruction.
Page 289
3.3.5.7 Interruption Function State (IFS – CR23) The IFS register is used to reload the current register stack frame (CFM) on a Return From Interruption (rfi). If the IFS is accessed while PSR.ic is 1, an Illegal Operation fault is raised. The IFS can only be accessed at privilege level 0; otherwise, a Privileged Operation fault is raised.
Page 290
3.3.5.10 Interruption Instruction Bundle Registers (IIB0-1 – CR26, 27) On an interruption and if PSR.ic is 1, the IIB registers receive the 16-byte instruction bundle corresponding to the interruption. The bundle reported in the IIB registers is the bundle exactly as it was fetched for execution of the instruction which raised the interruption.
Page 291
• An interruption selects bank 0, • rfi switches to the bank specified by IPSR.bn, or • bsw switches to the specified bank. On an interruption or bank switch, the processor ensures all prior register accesses (reads and writes) are performed to the prior register bank. Data values in banked registers are preserved across bank switches and both banks maintain NaT values when loaded from general registers.
Page 292
Processor Virtualization Processors in the Itanium Processor Family may optionally implement a mechanism to support processor virtualization. This includes an additional PSR.vm bit (see Section 3.3.2, “Processor Status Register (PSR)”), which, when 1, causes certain instructions to take a Virtualization fault (see Section 5.6, “Interruption Priorities”...
Page 293
Addressing and Protection This chapter defines operating system resources to translate 64-bit virtual addresses into physical addresses, 32-bit virtual addressing, virtual aliasing, physical addressing, memory ordering and properties of physical memory. Register state defined to support virtual memory management is defined in Chapter 3, while Chapter 5...
Page 294
Figure 4-1. Virtual Address Spaces Virtual Address 8 Virtual Regions Bytes 4K to 256M Per Region Pages Virtual Address Spaces By assigning sequential region identifiers, regions can be coalesced to produce larger 62-, 63- or 64-bit spaces. For example, an operating system could implement a 62-bit region for process private data, 62-bit region for I/O, and a 63-bit region for globally shared data.
Page 295
Virtual addressing for instruction references are enabled when PSR.it is 1, data references when PSR.dt is 1, and register stack accesses when PSR.rt is 1. Figure 4-2. Conceptual Virtual Address Translation for References Region Virtual Address Registers 63 61 60 Region ID Virtual Region Number (VRN) Virtual Page Number (VPN)
Page 296
The TLB is a local processor resource; installation of a translation or local processor purges do not affect other processor’s TLBs. Global TLB purges are provided to purge translations from all processors within a TLB coherence domain in a multiprocessor system.
Page 297
4.1.1.2 Translation Cache (TC) The Translation Cache (TC) is an implementation-specific structure defined to hold the large working set of dynamic translations for memory references (including IA-32). Please see the processor-specific documentation for further information on Itanium processor TC implementation details. The processor directly controls the replacement policy of all TC entries.
Page 298
inserted TC entry may be occasionally removed before this point, and software must be prepared to re-insert the TC entry on a subsequent fault. For example, eager or mandatory RSE activity, speculative VHPT walks, or other interruptions of the restart instruction may displace the software-inserted TC entry, but when software later re-inserts the same TC entry, the processor must eventually complete the restart instruction to ensure forward progress, even if that restart instruction takes other faults which must be handled before it can complete.
Page 299
4.1.1.4 Purge Behavior of TLB Inserts and Purges Translations contained in the translation caches (TC) and translation registers (TR) are maintained in a consistent state by ensuring that TLB insertions remove existing overlapping entries before new TR or TC entries are installed. Similarly, TLB purges that partially or fully overlap with existing translations may remove all overlapping entries.
Page 300
Note: Please refer to Table 4-1 for footnotes in Table 4-2. Table 4-1. Purge Behavior of TLB Inserts and Purges Case Insert? Purge? Machine Check? it[cr].[id] overlaps [ID]TC Must Must Must not it[cr].[id] overlaps [DI]TC Must Must not it[cr].[id] overlaps [ID]TR Must it[cr].[id] overlaps [DI]TR Must...
Page 301
Table 4-2. Purge behavior of VHPT Inserts VRN bits used for TLB searching on VHPT insert VRN bits not used for TLB searching on VHPT insert VRN Match No VRN Match Case Machine Machine Machine Insert? Purge? Insert? Purge? Insert? Purge? Check? Check?
Page 302
• The GR[r] value is checked when a TLB insert instruction is executed, and if reserved fields or reserved encodings are used, a Reserved Register/Field fault is raised on the TLB insert instruction. If GR[r]{0} is zero (not-present Translation Insertion Format), the rest of GR[r] is ignored. •...
Page 303
Accessed bit on a reference. GR[r]{6} Dirty Bit – When 0 and PSR.da is 0, Intel Itanium store or semaphore references to the page cause a Data Dirty Bit fault. When 0, IA-32 store or semaphore references to the page cause a Data Dirty Bit fault. The processor does not update the Dirty bit on a store or semaphore reference.
Page 304
Figure 4-6. Translation Insertion Format – Not Present 32 31 12 11 GR[r] ITIR rv/ci rv/ci RR[vrn] rv ig 4.1.1.6 Page Access Rights Page granular access controls use 4 levels of privilege. Privilege level 0 is the most privileged and has access to all privileged instructions; privilege level 3 is least privileged.
Page 305
Table 4-4. Page Access Rights (Continued) Privilege Level TLB.ar TLB.pl Description read, write, execute / read, write – – – – – – exec, promote / read, execute a. RSC.pl, for RSE fills and spills; PSR.cpl for all other accesses. b.
Page 306
Table 4-5. Architected Page Sizes Page Sizes 256k 256M Insertable Purgeable Page sizes are encoded in translation entries and region registers as a 6-bit encoded page size field. Each field specifies a mapping size of 2 bytes, thus a value of 12 represents a 4K-byte page.
Page 307
Table 4-6. Region Register Fields (Continued) Field Bits Description Preferred page Size – Selects the virtual address bits used in hash functions for set-associative TLBs or the VHPT. Encoded as 2 bytes. The processor may make significant performance optimizations for the specified preferred page size for the region.
Page 308
Processor models have at least 16 protection key registers, and at least 18-bits of protection key. Some processor models may implement additional protection key registers and protection key bits. Unimplemented bits and registers are reserved. Key registers have at least as many implemented key bits as region registers have rid bits. Additional implemented bits must be contiguous and start at bit 18.
Page 309
Table 4-8. Translation Instructions (Continued) Instr. Serialization Mnemonic Description Operation Type Requirement Insert data DTC = GR[r ], IFA, ITIR data itc.d r translation cache Insert instruction ITR[GR[r ]] = GR[r ], IFA, ITIR inst itr.i itr[r ] = r translation register Insert data...
Page 310
Figure 4-9. Virtual Hash Page Table (VHPT) Virtual Address PTA.size VHPT Region Optional Collision Search Chain Registers Install Optional Operating System Page Tables Hashing Function PTA.base The processor does not manage the VHPT or perform any writes into the table. Software is responsible for insertion of entries into the VHPT (including replacement algorithms), dirty/access bit updates, invalidation due to purges and coherency in a multiprocessor system.
Page 311
fault is raised. If the region-based short-format VHPT entry contains no reserved bits or encodings, it is installed into the TLB, and the processor again attempts to translate the failed instruction or data reference. If the long-format VHPT entry’s tag specifies the correct region identifier and virtual address, and the entry contains no reserved bits or encodings, it is installed into the TLB, and the processor again attempts to translate the failed instruction or data reference.
Page 312
• Protection Key – specified by the accessed region identifier value (RR[VA{63:61}].rid). As a result, all implementations must ensure that the number of implemented key bits is greater than or equal to the number of implemented region identifier bits. If a translation is marked as not present, ignored fields are usable by software as noted Figure 4-11.
Page 313
Figure 4-13. VHPT Not-present Long Format offset 32 31 2 1 0 For multiprocessor systems, atomic updates of long-format VHPT entries may be ensured by software as follows: • Before making multiple non-atomic updates to a VHPT entry in memory, software is required to set its ti bit to one.
Page 314
in which the VHPT is enabled, the operating system is required to maintain a per-region linear page table. As defined in Figure 4-14, the VHPT walker uses the virtual address, the region’s preferred page size, and the PTA.size field to compute a linear index into the short-format VHPT.
Page 315
the tag (ti bit) is zero for all valid tags. The hash index and tag together must uniquely identify a translation. The processor must ensure that the indices into the hashed table, the region’s preferred page size, and the tag specified in an indexed entry can be used in a reverse hash function to uniquely regenerate the region identifier and virtual address used to generate the index and tag.
Page 316
operating systems must ensure that the VHPT is aligned on the natural boundary of the structure; otherwise, processor operation is undefined. For example, a 64K-byte table must be aligned on a 64K-byte boundary. VHPT walker references to the VHPT are performed at privilege level 0, regardless of the state of PSR.cpl.
Page 317
4.1.8 Translation Searching The general sequence of searching the TLB and VHPT is shown in Figure 4-16. On a failed TLB search, if the VHPT walker is disabled for the referenced region an Alternate Instruction/Data TLB Miss fault is raised. If the VHPT walker is enabled for the referenced region, the VHPT is accessed to locate the missing translation.
Page 318
Figure 4-16. TLB/VHPT Search Virtual Address Virtual Address Unimplemented Data Address fault Implemented VA? Found Found Search TLB Search TLB Not Found Not Found Data Nested TLB fault Data PSR.ic Inst VHPT Walker Enabled Alternate Instruction TLB Miss fault VHPT Walker Enabled 1/In-flight Alternate Data TLB Miss fault...
Page 319
Table 4-10. TLB and VHPT Search Faults (Continued) Fault Description Instruction/Data TLB Miss Raised when the VHPT walker is enabled, but the processor: • Cannot locate the required VHPT entry, or • The processor aborts the VHPT search for implementation-specific reasons, or •...
Page 320
In the sign-extension model, software ensures that the upper 32-bits of a virtual address are always equal to bit 31. Address computations use the add, shladd, and sxt instructions. This model splits the 32 bit address space into two halves that are spread into 2 bytes of virtual regions 0 and 7 within the 64-bit virtual address space.
Page 321
Physical Addressing Objects in memory and I/O occupy a common 63-bit physical address space that is accessed using byte addresses. Accesses to physical memory and I/O may be performed via virtual addresses mapped to the 63-bit physical address space or by direct physical addressing.
Page 322
significant implemented physical address bit. In a processor that implements all physical address bits, IMPL_PA_MSB is 62. Please see the processor-specific documentation for further information on the number of physical address bits implemented on the Itanium processor. If unimplemented physical address bits are set by software, an Unimplemented Data Address fault is raised during the TLB insert instructions (itc, itr).
Page 323
4.3.3 Instruction Behavior with Unimplemented Addresses The use of an unimplemented address affects instruction execution as described in the bullet list below. If instruction address translation is enabled, an “unimplemented address” refers to an unimplemented virtual address. If instruction address translation is disabled, an “unimplemented address”...
Page 324
Table 4-11. Virtual Addressing Memory Attribute Encodings Coherent with Attribute Mnemonic ma Cacheability Write Policy Speculation Respect to Write Back Cacheable Write back WB, WBL Non-sequential & Write speculative Coalescing Not MP coherent Coalescing Uncacheable Uncacheable Sequential & Non-coalescing UC, UCE Uncacheable non-speculative Exported...
Page 325
Table 4-12. Physical Addressing Memory Attribute Encodings Coherent with Bit{63} Mnemonic Cacheability Write Policy Speculation respect to Cacheable Write Back Non-sequential & WBL, WB limited speculation Uncached Non-coalescing Sequential & UC, UCE non-speculative a. Coherency here refers to multiprocessor coherence on normal, side-effect free memory. “Speculation Attributes”...
Page 326
maintain coherency between processor local instruction and data caches for IA-32 code. Instruction caches are also not required to be coherent with multiprocessor Itanium instruction set originated memory references. Instruction caches are required to be coherent with multiprocessor IA-32 instruction set originated memory references. The processor must ensure that transactions from other I/O agents (such as DMA) are physically coherent with the instruction and data cache.
Page 327
become flushed and made visible prior to itself becoming visible. Even though IA-32 stores and loads are ordered, the write-coalesced data is not flushed unless the IA-32 stores or loads are to uncached memory types. The Flush Cache (fc, fc.i) instruction flushes all write-coalesced data whose address is within at least 32 bytes of the 32-byte aligned address specified by the Flush Cache (fc, fc.i) instruction, forcing the data to become visible.
Page 328
Prefetches are enabled if a speculative translation exists. Prefetches are asynchronous data and instruction memory accesses that appear logically to initiate and finish between some pair of instructions. This access may not be visible to subsequent flush cache (fc, fc.i) and/or TLB purge instructions. This behavior is implementation-dependent.
Page 329
a. Speculative or speculative advanced loads that cause deferred exceptions result in failed speculation. The processor aborts the reference. If the target of the load is a GR, the processor sets the register’s NaT bit to one. If the target of the load is an FR, the processor sets the target FR to NaTVal. The processor performs all other side-effects (such as post-increment).
Page 330
• It takes an External interrupt, but if it had not taken an External interrupt, it would have met one of the above qualifications (execute without fault, take an Unaligned Data Reference fault, or take a Data Debug fault) Data-speculative loads are treated the same as normal loads, and if an in-order execution of the program requires the execution of a data speculative load, it constitutes a verified reference.
Page 331
Table 4-15. Ordering Semantics and Instructions Ordering ® ® Description Orderable Intel Itanium Instructions Semantics Unordered instructions may become visible in ld, ld.s, ld.a, ld.sa, ld.fill, any order. ldf, ldf.s, ldf.sa, ldf.fill, ldfp, ldfp.s, ldfp.sa,...
Page 332
Inter-Processor Interrupt Messages (8-byte stores to a Processor Interrupt Block address, through a UC memory attribute) are exceptions to the sequential semantics. IPI's are not ordered with respect to other IPI's directed at the same processor. Further, fence operations do not enforce ordering between two IPI's. See Section 5.8.4.2, “Interrupt and IPI Ordering”...
Page 333
accesses of different sizes but with overlapping memory references appear to complete non-atomically. To ensure that a memory write is globally observed prior to a memory read, software must place an explicit fence operation between the two operations. Aligned st.rel and semaphore operations from multiple processors to cacheable write-back memory become visible to all observers in a single total order (i.e., in a particular interleaving;...
Page 334
ld x = [b] cmp.eq p1 = x, ‘new’ (p1) br target target: ld y = [a] if the second processor observes the store to [b], it will also observe the store to [a]. The flush cache (fc, fc.i) instruction follows data dependency ordering. fc and fc.i are ordered only with respect to previous and subsequent load, store, or semaphore instructions to the same line, regardless of the specified memory attribute.
Page 335
Page Consumption fault. cmpxchg and xchg accesses to pages with other memory attributes cause an Unsupported Data Reference fault. • fetchadd: The fetchadd instruction can be executed successfully only if the access is to a cacheable page with write-back write policy or to a UCE page. fetchadd accesses to NaTPages cause a Data NaT Page Consumption fault.
Page 336
undefined behavior; when changing an existing page from speculative to non-speculative (or vice-versa), software should ensure that any ALAT entries corresponding to that page are invalidated. Limited speculation pages behave like non-speculative pages with respect to speculative advanced loads, and behave like speculative pages with respect to all other advanced and/or check loads.
Page 337
3. mf ;; // Ensure visibility of ptc.ga to local data stream srlz.i ;; // Ensure visibility of ptc.ga to local instruction stream After step 3, no processor in the coherence domain will initiate new memory references or prefetches to the old translation. Note, however, that memory references or prefetches initiated to the old translation prior to step 2 may still be in progress after step 3.
Page 338
9. Call PAL_MC_DRAIN 10. Using the IPI mechanism defined in “Inter-processor Interrupt Messages” on page 2:128 to reach all processors in the coherence domain, perform step 9 above on all processors in the coherence domain, and wait for all PAL_MC_DRAIN calls to complete on all processors in the coherence domain before continuing.
Page 339
// Ensure cache flushes are also seen by processors' instruction fetch sync.i ;; After step 3, all flush cache instructions initiated in step 3 are visible to all processors in the coherence domain, i.e., no processor in the coherence domain will respond with a cache line hit on a memory reference to an address belonging to page “X.”...
Page 340
3. Execute: mf ;; srlz.i ;; (The ensures visibility of ptr.d, ptr.i, or ptc.ga to both data and instruction stream, so that no new prefetches will be done to the old translations.) 4. Call PAL_PREFETCH_VISIBILITY with the input argument trans_type equal to one to indicate that the transition is for all memory attributes.
Page 341
8. If PAL_CACHE_FLUSH is used to flush caches, it must also be called on all processors in the coherency domain. In any case, PAL_MC_DRAIN must be called on all processors. Using the IPI mechanism defined in Section 5.8.4.1, “Inter-processor Interrupt Messages” on page 2:128 to reach all processors in the coherence domain, perform step 6.a, if necessary, and step 7 above in that order on all processors in the coherence domain, and wait for all PAL_MC_DRAIN...
Page 342
boundaries respectively to avoid generation of an Unaligned Data Reference fault. When PSR.ac is 1, any IA-32 data memory reference that is not aligned on a boundary the size of the operand results in an IA_32_Exception(AlignmentCheck) fault. Note: 10-byte and floating-point load double pair datum alignment is 16-bytes. The alignment of long format 32-byte VHPT references is always 32-bytes.
Page 343
Interruptions Interruptions are events that occur during instruction processing, causing the flow control to be passed to an interruption handling routine. In the process, certain processor state is saved automatically by the processor. Upon completion of interruption processing, a return from interruption (rfi) is executed which restores the saved processor state.
Page 344
Non-Maskable Interrupts are used to request critical operating system services. NMIs are assigned external interrupt vector number 2. • External Controller Interrupts (ExtINT) External Controller Interrupts are used to service Intel 8259A-compatible external interrupt controllers. ExtINTs are assigned locally within the processor to external interrupt vector number 0.
Page 345
and all previous instructions are completed. Subsequent instructions have no effect on machine state. Traps are IVA-based interruptions. Figure 5-1 summarizes the above classification. Figure 5-1. Interruption Classification Aborts Interrupts Faults Traps INIT RESET (NMI, ExtINT, ...) PAL-based Interruptions IVA-based Interruptions Unless otherwise indicated, the term “interruptions”...
Page 346
Upon an interruption, asynchronous events such as external interrupt delivery are disabled automatically by hardware to allow software to either handle the interruption immediately or to safely unload the interruption resources and save them to memory. Software will either deal with the cause of the interruption and rfi back to the point of the interruption, or it will establish a new environment and spill processor state to memory to prepare for a call to higher-level code.
Page 347
4. For Itanium architecture-based code, the processor checks for a valid register stack frame. • If incomplete and RSE Current Frame Load Enable (RSE.CFLE) is set, then perform a mandatory RSE load and start again at step one. The mandatory load operation may fault.
Page 348
breakpoint faults. The IA-32 effective instruction address (EIP) is converted into a 64-bit virtual linear address IP and IA-32 defined code segmentation and code fetch faults are checked and may result in a fault. 7. When PSR.is is 0, the bundle is fetched using the IP. When PSR.is is 1, an IA-32 instruction is fetched using IP.
Page 349
• If more than one trap is triggered (such as Unimplemented Instruction Address trap, Lower-Privilege Transfer trap, and Single Step trap) the highest priority trap is taken. The ISR.code contains a bit vector with one bit set for each trap triggered.
Page 350
branch-related traps, IIP is written with the target of the branch; for all other traps, IIP is written with the address of the bundle or IA-32 instruction containing the next sequential instruction. • IIPA receives the IP of the last successfully executed Itanium instruction. For IA-32 instructions, IIPA receives the IP of the faulting or trapping IA-32 instruction.
Page 351
registers, overlapping GR16 to GR31. Which set of physical registers are accessed through GR16 to GR31 is determined by the PSR.bn bit. On an interruption this bit is forced to zero allowing access to the alternate set of 16 registers which can be used as scratch space or to hold predetermined values.
Page 352
These non-access Itanium instructions can cause interruptions: fc, fc.i, lfetch.fault, probe, probe.fault, tpa, and tak. (tak can cause interruptions only for non-TLB reasons.) ISR.code will be set to indicate which non-access instruction caused the interruption. See Table 5-1 for ISR field settings for non-access instructions. Table 5-1.
Page 353
5.5.5 Deferral of Speculative Load Faults Speculative and speculative advanced loads can defer fault handling by suppressing the speculative memory reference, and by setting the deferred exception indicator (NaT bit or NaTVal) of the load target register. Other effects of the instruction (such as post increment) are performed.
Page 354
Aborts, external interrupts, RSE or instruction-fetch-related faults that happen to occur on a speculative load are always raised (since they are not related to the speculative load instruction). Illegal Operation faults and Disabled Floating-point Register faults that occur on a speculative load are always raised. Processing of exception conditions for speculative and speculative advanced loads is done in three stages: qualification, deferral and prioritization.
Page 355
Deferral is controlled by PSR.ed, PSR.it, PSR.ic, the speculative deferral control bits in the DCR, the exception deferral bit of the code page’s instruction TLB entry (ITLB.ed), and the memory attribute of the referenced data page. The speculative load and speculative advanced load exception deferral conditions are as follows: •...
Page 356
exception condition which is neither precluded nor deferred. Prioritization of non-deferred speculative load faults follows the same interruption priorities as non-speculative instruction faults (Table 5-6 on page 2:109). However, deferred speculative load faults do not take part in the prioritization. As a result, depending on DCR settings, a lower priority fault may be taken, even if a higher priority exception condition exists, but is deferred.
Page 358
Vector Name Class Disabled FP-Register vector Disabled Floating-point Register fault IA-32, General Exception vector Disabled Instruction Set Transition fault Intel Itanium IA-32 Exception vector (DNA) IA-32 Device Not Available fault IA-32 IA-32 Exception vector (FPError) IA-32 FP Error fault IA-32,...
Page 359
Table 5-6. Interruption Priorities (Continued) IA-32 Type Instr. Set Interruption Name Vector Name Class IA-32 Intercept vector (SystemFlag) IA-32 System Flag Intercept trap IA-32 Intercept vector (Gate) IA-32 Gate Intercept trap IA-32 Exception vector (Overflow) IA-32 INTO trap IA-32 Exception vector (Break) IA-32 Breakpoint (INT 3) trap IA-32 IA-32 Interrupt vector (Vector#)
Page 360
greater than the page boundary, any Instruction TLB faults on the second page have higher priority than the IA-32 Code Fetch fault. Class B Faults from decoding an instruction. Priority of IA-32 Instruction Length, – IA-32 Invalid Opcode, and IA-32 Instruction Intercept, Disabled Floating Point Register, Disabled Instruction Set Transition, and Device Not Available faults are model specific.
Page 361
IVA-based Interruption Vectors Table 5-7 contains the processor’s interruption vector table (IVT). The base of the IVT is held in the IVA control register. The size of the IVT is 32KB. The first 20 vectors are designed to provide more code space by allowing 64 bundles per vector (16 bytes per bundle) for performance-critical interruption handlers.
Page 363
(LINT, INIT, PMI) , and are always directed to the local processor. The LINT pins can be connected directly to an Intel 8259A-compatible external interrupt controller. The LINT pins are programmable to be either edge-sensitive or level-sensitive, and for the kind of interrupt that gets generated. If programmed to generate external interrupts, the vector number is a programmed constant per LINT pin.
Page 364
• Internal processor interrupts such as interval timer, performance monitoring, – and corrected machine checks. These are always directed to the local processor. A unique vector number can be programmed for each source. • Other processors A processor can interrupt any individual processor, including –...
Page 365
• The priority of interrupts is defined in Table 5-8. Entry A is higher priority than interrupt B, if entry A appears at a higher location in the table than entry B. Interrupt priority is used to select interrupts that require urgent service over less urgent interrupt requests.
Page 366
0 - 255. Vector numbers 1 and 3 through 14 are reserved for future use. Vector number 0 (ExtINT) is used to service Intel 8259A-compatible external interrupt controllers. Vector number 2 is used for the Non-Maskable Interrupt (NMI). The remaining 240 external interrupt vector numbers (16 through 255) are available for general operating system use.
Page 367
Table 5-8. Interrupt Priorities, Enabling, and Masking Interrupt Priority Vector Interrupt Unmasked Priority Interrupt Delivery Class Number Condition Enabled Highest INIT if PSR.mc is 0 Always 0..3 if PSR.ic is 1 Always 2 (NMI) if PSR.i is 1 Interrupt is higher priority than all in-service external interrupts 0 (ExtINT) TPR.mmi is 0, and interrupt is...
Page 368
The processor provides nested interrupt priority support for external interrupt vectors 0, 2, and 16 through 255 by: • Automatically masking external interrupts of equal or lower priority than the highest priority external interrupt currently in-service. This raises the in-service external interrupt masking level when each external interrupt begins service by an IVR read.
Page 369
ssm PSR.i srlz.d // external interrupts may be sampled anywhere here rsm PSR.i The stop following the srlz.d instruction in the above code sequence is required to force the Reset System Mask (rsm) instruction into a subsequent instruction group. The stop guarantees that the srlz.d will open the external interrupt window for at least one cycle before the rsm instruction closes it again.
Page 370
Table 5-9. External Interrupt Control Registers Register Name Description CR64 Local ID CR65 External Interrupt Vector Register (read only) CR66 Task Priority Register CR67 End Of External Interrupt CR68 IRR0 External Interrupt Request Register 0 (read only) CR69 IRR1 External Interrupt Request Register 1 (read only) CR70 IRR2 External Interrupt Request Register 2 (read only)
Page 371
IVR is a read-only register; writes to IVR result in a Illegal Operation fault. IVR reads do not issue an external INTA cycle. If the interrupt vector must be acquired from an Intel 8259A-compatible external interrupt controller, software should perform a load from the INTA byte. See “Interrupt Acknowledge (INTA) Cycle”...
Page 372
PSR.up is set to 1, potentially enabling performance monitor interrupts, and the new priority levels need to be in place before this enabling, a data serialization must be performed. (Note that there's no dependence between writing TPR and then changing the PSR for any other bits in the PSR than these.) A data serialization operation must be performed after TPR is written and before IVR is read to ensure that the reported IVR vector is correctly masked.
Page 373
5.8.3.5 External Interrupt Request Registers (IRR0-3 – CR68,69,70,71) Four 64-bit read-only External Interrupt Request Registers (IRR0-3, see Figure 5-10) provide the capability for software to determine the set of pending asynchronous external interrupts. IRR0 contains vectors <63:0> where vector 0 is in bit position 0, IRR1 contains vectors <127:64>, IRR2 contains vectors <191:128>, and IRR3 contains vectors <255:192>.
Page 374
5.8.3.7 Performance Monitoring Vector (PMV – CR73) PMV specifies the external interrupt vector number for Performance Monitoring overflow interrupts. To ensure that subsequent performance monitor interrupts reflect the new state of PMV by a given point in program execution, software must perform a data serialization operation after a PMV write and prior to that point.
Page 375
INIT – pend an Initialization Interrupt for system firmware. The vector field is ignored. reserved ExtINT – pend an Intel 8259A-compatible interrupt. This interrupt is delivered at external interrupt vector number 0. For details on servicing ExtINT external interrupts see “Interrupt Acknowledge (INTA) Cycle”...
Page 376
Figure 5-15. Processor Interrupt Block Memory Layout +0x1FFFFF Undefined ..+0x1E0008 Undefined INTA +0x1E0000 Undefined +0x100000 ....... +0x000020 +0x000018 +0x000010 +0x000008 +0x000000 ib_base The Inter-Processor Interrupt region occupies the lower half of the Processor Interrupt Block; by default its physical address range is 0x0000 0000 FEE0 0000 through 0x0000 0000 FEEF FFFF.
Page 377
INIT – pend an Initialization Interrupt for platform firmware on the processor listed in the destination. The vector field is ignored. Reserved ExtINT – pend an Intel 8259A-compatible interrupt. This interrupt is delivered at external interrupt vector number 0. For details on servicing ExtINT external interrupts see “Interrupt Acknowledge (INTA) Cycle”...
Page 378
The INTA Byte is located within the upper half of the Processor Interrupt Block, at offset 0x1E0000 from the base. A single byte load from the INTA address causes the processor to emit the INTA cycle on the processor system bus. An Intel 8259A-compatible external interrupt controller must respond with the actual interrupt vector number as the data to be loaded.
Page 379
processor does not interpret any data stored to the XTP Byte address and all data bits are passed to the external system unmodified. Any memory operation to the XTP address other than a single byte store is undefined. XTPR is written by operating system code to notify the system that the processor’s current task priority has been changed.
Page 381
Register Stack Engine The register stack engine (RSE) moves registers between the register stack and the backing store in memory without explicit program intervention. The RSE operates concurrently with the processor and can take advantage of unused memory bandwidth to dynamically issue register spill and fill operations. In this manner, the latency of register spill/fill operations can be overlapped with useful program work.
Page 382
a stacked register from the backing store it also fills the register’s NaT bit. Whenever bits 8:3 of the RSE backing store load pointer are all ones, the RSE reloads a NaT collection from the backing store. Bit 63 of the NaT collection is ignored when read from the backing store.
Page 383
The RSE operates concurrently and asynchronously with respect to instruction execution by taking advantage of unused memory bandwidth to dynamically perform register spill and fill operations. The algorithm employed by the RSE to determine whether and when to spill/fill is implementation dependent. Software can not depend on the spill/fill algorithm.
Page 384
Table 6-1. RSE Internal State (Continued) Name Description Corresponds To RSE.ndirty Number of dirty registers on the register stack RSE.ndirty_words Number of dirty words on the register stack plus AR[BSP] - corresponding number of NaT collection AR[BSPSTORE] registers Register Stack Partitions The processor’s physical register file provides at least 96 stacked registers.
Page 385
Figure 6-3. Four Partitions of the Register Stack Invalid Physical Stacked Registers RSE.LoadReg RSE.StoreReg RSE.BOF CFM.sof Clean Dirty Current RSE Store return, rfi call, cover return, rfi, alloc RSE Load Higher Addresses RSE.BspLoad AR[BSPSTORE] AR[BSP] Backing Store The boundaries between the four register stack partitions are defined by the current frame marker (CFM) and three physical register numbers: a load, store and bottom-of-frame register number.
Page 386
place at lower addresses, defined relative to BSP by the sizes of the clean and dirty partitions. Although the stack is conceptually infinite in both directions, the effective base of the stack is expected to be the first memory location of the first page allocated to the backing store.
Page 387
RSE Control The RSE can be controlled at all privilege levels by means of three instructions (cover, flushrs, and loadrs) and by accessing four application registers (mov to/from RSC, BSP, BSPSTORE and RNAT). This section first presents each of the RSE application registers, and then discusses the three RSE control instructions.
Page 388
Protection is also checked based on the current entries in the data TLB. The RSE always remains coherent with respect to the data TLB. If a translation that is being used by the RSE is changed or purged, the RSE will immediately begin using the new translation or suffer a TLB miss.
Page 389
6.5.3 Backing Store Pointer Application Registers The RSE defines two Backing Store Pointer application registers: BSPSTORE and BSP. Since the RSE backing store pointers are always 8-byte aligned, bits {2:0} of the backing store pointers always read as zero. When writing the BSPSTORE application register, bits {2:0} in the presented address are ignored.
Page 391
Table 6-5. RSE Control Instructions Instruction Affected State cover flushrs loadrs AR[BSP]{63:3} AR[BSP]{63:3}+ CFM.sof + Unchanged Unchanged (AR[BSP]{8:3} + CFM.sof)/63 AR[BSPSTORE]{63:3} Unchanged AR[BSP]{63:3} AR[BSP]{63:3} - AR[RSC].loadrs{13:3} RSE.BspLoad{63:3} Unchanged Model specific AR[BSP]{63:3} - AR[RSC].loadrs{13:3} AR[RNAT] Unchanged Updated UNDEFINED RSE.RNATBitIndex Unchanged AR[BSPSTORE]{8:3} AR[BSPSTORE]{8:3} CR[IFS] if (PSR.ic == 0) {...
Page 392
• The CFM (after the return) is forced to zero; i.e., all CFM fields (including CFM.sof and CFM.sol) are set to zero. • The registers from the returned-from frame and the preserved registers from the returned-to frame are added to the invalid partition of the register stack. •...
Page 393
frame of the target instruction. When RSE.CFLE is set, instruction execution is stalled until the RSE has completely restored the current frame or an interruption occurs. This is the only time that the RSE issues any memory traffic for the current frame. Interruption delivery clears RSE.CFLE which allows an interruption handler to execute in the presence of an incomplete frame (e.g., to handle the fault raised by the mandatory RSE load).
Page 394
RSE Behavior on Interruptions When the processor raises an interruption, the current register stack frame remains unchanged. If PSR.ic is one, the valid bit in the Interruption Function State register (IFS.v) is cleared. When the IFS.v bit is clear, the contents of the interruption frame marker field (IFS.ifm) are undefined.
Page 395
current frame again (either via another alloc instruction, or via a br.ret or rfi to a previous frame that contained that register), the value stored in the register, the NaT bit for the register, and the corresponding ALAT entry for the register remain undefined. RSE stores do not invalidate ALAT entries.
Page 396
3. Non-preemptive, synchronous backing store switch (covers system calls, user-level thread and operating system context switches) Failure to follow these sequences may result in undefined RSE and processor behavior. 6.11.1 Switch from Interrupted Context To switch from the backing store of an interrupted context to a new backing store: 1.
Page 397
1. Read and save the RSC, BSP and PFS application registers. 2. Issue a flushrs instruction to flush the dirty registers to the backing store. 3. Place RSE in enforced lazy mode by clearing both RSC.mode bits. 4. Read and save the RNAT application register. 5.
Page 398
2:150 Volume 2, Part 1: Register Stack Engine...
Page 399
Debugging and Performance Monitoring Processors based on the Itanium architecture provide comprehensive debugging and performance monitoring facilities for both IA-32 and Itanium instructions. This chapter describes the debug registers, performance monitoring registers and their programming models. The debugging facilities include several data and instruction break point registers, single step trap, breakpoint instruction fault, taken branch trap, lower privilege transfer trap, instruction and data debug faults.
Page 400
reference that matches the parameters specified by the IBR registers results in an IA_32_Exception(Debug) fault. If PSR.id is 1 or EFLAG.rf is 1, IA-32 Instruction Debug faults are disabled for one instruction. The successful execution of an IA-32 instruction clears the PSR.id and EFLAG.rf bits. •...
Page 401
Instruction/Data TLB Miss fault. If DBR.r and DBR.w are both 0, that data breakpoint register is disabled. Execute match enable – When IBR.x is 1, execution of an IA-32 instruction or Intel Itanium instruction in a bundle at an address matching the corresponding address register causes a breakpoint.
Page 402
Changes to debug registers and PSR are not necessarily observed by following instructions. Software should issue a data serialization operation to ensure modifications to DBR, PSR.db, PSR.tb and PSR.lp are observed before a dependent instruction is executed. For register changes to IBR and PSR.db that affect fetching of subsequent instructions, software must issue an instruction serialization operation.
Page 403
• The cmp8xchg16 operands are treated as 16-byte datums for both read and write breakpoint matching, even though this instruction only reads 8 bytes. Address breakpoint Data Debug faults are not reported for the Flush Cache (fc, fc.i), regular_form probe, non-faulting lfetch, insert TLB (itc, itr), purge TLB (ptc, ptr), or translation access (thash, ttag, tak, tpa) instructions.
Page 404
Processor implementations may not populate the entire PMC/PMD register space. Reading of an unimplemented PMC or PMD register returns zero. Writes to unimplemented PMC or PMD registers are ignored; i.e., the written value is discarded. Writes to PMD and PMC and reads from PMC are privileged operations. At non-zero privilege levels, these operations result in a Privileged Operation fault, regardless of the register address.
Page 405
A counter overflow interrupt occurs when the counter wraps; i.e., a carry out from bit W-1 is detected. Counter overflow interrupts are edge-triggered; that is, the event of a counter incrementing and causing carry out from bit W-1 thus setting the overflow bit and the freeze bit, generates one PMU interrupt.
Page 406
Table 7-4. Generic Performance Counter Configuration Register Fields (PMC[4]..PMC[p]) (Continued) Field Bits Description Privileged monitor – When 0, the performance monitor is configured as a user monitor, and enabled by PSR.up. When PMC.pm is 1, the performance monitor is configured as a privileged monitor, enabled by PSR.pp, and the corresponding PMD can only be read by privileged software.
Page 407
Table 7-5. Reading Performance Monitor Data Registers (Continued) PSR.sp PMC[i].pm PSR.cpl PMD Reads Return >0 >0 >0 Generic PMD counter registers may be read by software without stopping the counters. Under normal counting conditions (PMC[0].fr is zero and has been serialized), the processor guarantees that a sequence of reads of a given PMD will return non-decreasing values corresponding to the program order of the reads.
Page 408
7.2.2 Performance Monitor Overflow Status Registers (PMC[0]..PMC[3]) Performance monitor interrupts may be caused by an overflow from a generic performance monitor or an implementation-dependent event from a model-specific monitor. The four performance monitor overflow registers (PMC[0]...PMC[3]) shown in Figure 7-6 indicate which monitor caused the interruption.
Page 409
If control register bit PMV.m is one, a performance monitoring interrupt is disabled from being pended. When PMV.m is zero, the interruption is received and held pending. (Further masking by the PSR.i, TPR and in-service masking can keep the interrupt from being raised.) Figure 7-6 shows the Performance Monitor Overflow Status registers.
Page 410
Multiple overflow bits may be set to 1, if counters overflow concurrently. The overflow bits and the freeze bit are sticky; i.e., the processor sets them to 1 but never resets them to 0. It is software's responsibility to reset the overflow and freeze bits. The overflow status bits are populated only for implemented counters.
Page 411
follow the implementation-independent overflow interrupt service routine outlined in Figure 7-7. Use of alternate context-switch sequences may be incompatible with future implementations. If the outgoing context has an interrupt pending but has not yet invoked the performance monitor interrupt service routine, the interrupt may be delivered to the incoming context even if it is a non-monitored process.
Page 412
When switching back to the original context (that originally caused the counter overflow), the previously saved freeze bit can be inspected. If it was set (meaning there was a pending performance monitor interrupt), then the context switch routine posts an interrupt message to the incoming context’s processor at the performance monitor vector specified by the PMV register (see Section 10.5.8, “Inter-processor Interrupts Layout and Example”...
Page 413
Interruption Vector Descriptions Chapter 5 describes the interruption mechanism and programming model for the Itanium architecture. This chapter describes the IVA-based interruption handlers. “Interruption Vector Descriptions” describes all the Itanium IVA-based interruption vectors and “IA-32 Interruption Vector Definitions” describes all of the IA-32 interrupt vectors.
Page 414
Interruption Vector Definition Table 8-1.Writing of Interruption Resources by Vector IIP, IPSR, Interruption Resource ITIR IIB0, IIB1 IIPA, IFS.v PSR.ic at time of interruption Alternate Data TLB vector Alternate Data TLB fault IR Alternate Data TLB fault Alternate Instruction TLB vector Alternate Instruction TLB fault Break Instruction vector Break Instruction fault...
Page 415
Table 8-1.Writing of Interruption Resources by Vector (Continued) IIP, IPSR, Interruption Resource ITIR IIB0, IIB1 IIPA, IFS.v PSR.ic at time of interruption Reserved Register/Field fault Unimplemented Data Address fault IA-32 Exception vector IA-32 Intercept vector IA-32 Interrupt vector Instruction Access Rights vector Instruction Access Rights fault Instruction Access-Bit vector...
Page 416
Table 8-1.Writing of Interruption Resources by Vector (Continued) IIP, IPSR, Interruption Resource ITIR IIB0, IIB1 IIPA, IFS.v PSR.ic at time of interruption Unaligned Data Reference fault Unsupported Data Reference vector Unsupported Data Reference fault VHPT Translation vector IR VHPT Data fault VHPT Data fault VHPT Instruction fault Virtual External Interrupt vector...
Page 417
Table 8-2. ISR Values on Interruption (Continued) Vector / Interruption Instruction Debug fault IR Data Debug fault Dirty-Bit vector Data Dirty Bit fault Disabled FP-Register vector Disabled Floating-Point Register fault External Interrupt vector External Interrupt Floating-point Fault vector Floating-Point Exception fault Floating-point Trap vector Floating-Point Exception trap General Exception vector...
Page 418
Software must look at the ISR.code bit vector to determine if any lower priority trap occurred at the same time as the trap being processed. ® ® Table 8-3. ISR.code Fields on Intel Itanium Traps Field Description Floating-Point Exception trap...
Page 419
® ® Table 8-3. ISR.code Fields on Intel Itanium Traps (Continued) Field Description Taken Branch trap Single Step trap Unimplemented Instruction Address trap fp trap code IEEE O (overflow) exception (Parallel FP-LO) fp trap code IEEE U (underflow) exception (Parallel FP-LO)
Page 421
VHPT Translation vector (0x0000) Name Cause The hardware VHPT walker encountered a TLB miss while attempting to reference the virtually addressed hashed page table for a memory reference (including IA-32). Interruptions on this vector: IR VHPT Data fault VHPT Instruction fault VHPT Data fault Parameters IIP, IPSR, IIPA, IFS –...
Page 422
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 0 ni 0 0 0 0 0 0 1 Notes This fault can only occur when PSR.ic is 1 or in-flight, and the VHPT walker is enabled...
Page 423
Instruction TLB vector (0x0400) Name Cause The instruction TLB entry needed by an instruction fetch (including IA-32) is absent, and the hardware VHPT walker could not find the translation in the VHPT, or the hardware VHPT walker is enabled but not implemented on this processor. Interruptions on this vector: Instruction TLB fault Parameters...
Page 424
Data TLB vector (0x0800) Name Cause For memory references (including IA-32), the data TLB entry needed by the data access is absent, and the hardware VHPT walker could not find the translation in the VHPT, or the hardware VHPT walker is not implemented on this processor. Interruptions on this vector: IR Data TLB fault Data TLB fault...
Page 425
Alternate Instruction TLB vector (0x0c00) Name Cause The instruction TLB entry needed by an instruction fetch (including IA-32) is absent, and the hardware VHPT walker was not enabled for this address. Interruptions on this vector: Alternate Instruction TLB fault Parameters IIP, IPSR, IIPA, IFS –...
Page 426
Alternate Data TLB vector (0x1000) Name Cause For memory references (including IA-32), the data TLB entry needed by data access is absent, and the hardware VHPT walker was not enabled for this address. Interruptions on this vector: IR Alternate Data TLB fault Alternate Data TLB fault Parameters IIP, IPSR, IIPA, IFS –...
Page 427
Data Nested TLB vector (0x1400) Name Cause For memory references, the data TLB entry needed for a data reference is absent and PSR.ic is 0. Note: Data Nested TLB faults cannot occur during IA-32 instruction set execution, since PSR.ic must be 1. Interruptions on this vector: IR Data Nested TLB fault Data Nested TLB fault...
Page 428
Instruction Key Miss vector (0x1800) Name Cause For instruction fetches (including IA-32), the PSR.it bit is 1, the PSR.pk bit is 1, and the access key from the TLB entry for the address of the executing instruction bundle does not match any of the valid protection keys. Interruptions on this vector: Instruction Key Miss fault Parameters...
Page 429
Data Key Miss vector (0x1c00) Name Cause For memory references (including IA-32), the PSR.dt bit is 1, the PSR.pk bit is 1, and the access key from the TLB entry for the address referenced by a load, store, probe (regular_form probe or probe.fault) or semaphore operation does not match any of the valid protection keys.
Page 430
Dirty-Bit vector (0x2000) Name Cause IA-32 or Itanium store or semaphore operations to a page with the dirty-bit (TLB.d) equal to 0 in the data TLB. Interruptions on this vector: Data Dirty Bit fault Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description.
Page 431
Instruction Access-Bit vector (0x2400) Name Cause For instruction fetches (including IA-32), the access bit (TLB.a) in the TLB entry for this page is 0, and an instruction on the page is referenced. Interruptions on this vector: Instruction Access Bit fault Parameters IIP, IPSR, IIPA, IFS –...
Page 432
Data Access-Bit vector (0x2800) Name Cause For data memory references (including IA-32), the access bit (TLB.a) in the TLB entry for this page is 0, and the page is referenced. Interruptions on this vector: IR Data Access Bit fault Data Access Bit fault Parameters IIP, IPSR, IIPA, IFS –...
Page 433
Break Instruction vector (0x2c00) Name Cause An attempt is made to execute an Itanium break instruction. Interruptions on this vector: Break Instruction fault Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description. IIM – Is updated with the break instruction immediate value. IIB0, IIB1 –...
Page 434
External Interrupt vector (0x3000) Name Cause There are unmasked external interrupts pending from external devices, other processors, or internal processor events and: • PSR.i is 1, while executing Itanium instructions • PSR.i is 1 and (CFLAG.if is 0 or EFLAG.if is 1), while executing IA-32 instructions IPSR.is indicates which instruction set was executing at the time of the interruption.
Page 435
Virtual External Interrupt vector (0x3400) Name Cause The guest highest pending interrupt (GHPI) specified by the VMM is unmasked on the virtual processor. IPSR.is indicates which instruction set was executing at the time of the interruption. Interruptions on this vector: Virtual External Interrupt Parameters IIP, IPSR, IIPA, IFS –...
Page 436
Page Not Present vector (0x5000) Name Cause The bundle or IA-32 instruction being executed resides on a page for which the P-bit (TLB.p) in the instruction TLB entry is 0, or the data being referenced resides on a page for which the P-bit in the data TLB entry is 0. Interruptions on this vector: IR Data Page Not Present fault Instruction Page Not Present fault...
Page 437
Key Permission vector (0x5100) Name Cause Data access (including IA-32): The PSR.dt bit is 1, the PSR.pk bit is 1 and read or write permission is disabled by the matching protection register on a load, store, or semaphore operation. The RSE may cause this fault if PSR.rt is 1, the PSR.pk bit is 1 and read or write permission is disabled by the matching protection register on an RSE mandatory load/store operation.
Page 438
Instruction Access Rights vector (0x5200) Name Cause For instruction fetches (including IA-32), the PSR.it bit is 1, and the access rights for this page do not allow execution or do not allow execution at the current privilege level. Interruptions on this vector: Instruction Access Rights fault Parameters IIP, IPSR, IIPA, IFS –...
Page 439
Data Access Rights vector (0x5300) Name Cause For memory references (including IA-32), the PSR.dt bit is 1, and the access rights for this page do not allow read access or do not allow read access at the current privilege level for load and semaphore operations. The PSR.dt bit is 1, and the access rights for this page do not allow write access or do not allow write access at the current privilege level for store and semaphore operations.
Page 440
General Exception vector (0x5400) Name Cause An attempt is being made to execute an illegal operation, privileged instruction, access a privileged register, unimplemented field, unimplemented register, unimplemented address, or take an inter-instruction set branch when disabled. Interruptions on this vector: IR Unimplemented Data Address fault Illegal Operation fault Illegal Dependency fault...
Page 441
• If the instruction has two PR targets, and specifies the same PR for both, predicated-off unconditional compare, fclass, tbit, tnat, and tf instructions take this fault, even when their qualifying predicate is zero. • Register bank conflict on a floating-point load pair instruction. •...
Page 442
• ISR.code{7:4} = 4: Disabled Instruction Set Transition fault. An instruction set transition was attempted while PSR.di was 1. This fault can be raised by either the Itanium br.ia instruction or the IA-32 jmpe instruction. IPSR.is indicates the faulting instruction set. •...
Page 443
Disabled FP-Register vector (0x5500) Name Cause An attempt is made to reference a floating-point register set that is disabled. When PSR.dfl is 1, execution of any IA-32 FP, SSE or MMX technology instructions raises a Disabled FP Register Low Fault (regardless of whether FR2 - FR31 are actually referenced).
Page 444
NaT Consumption vector (0x5600) Name Cause A non-speculative operation (including IA-32) (e.g., load, store, control register access, instruction fetch etc.) read a NaT source register, NaTVal source register, or referenced a NaTPage. Interruptions on this vector: IR Data NaT Page Consumption fault Instruction NaT Page Consumption fault Register NaT Consumption fault Data NaT Page Consumption fault...
Page 445
behavior of NaT and NaTVal values is model specific, see Section 6.2.4.3, “NaT/NaTVal Response for IA-32 Instructions” on page 1:134 for details. • ISR – The value for the ISR bits depend on the type of access performed and are specified below.
Page 446
Speculation vector (0x5700) Name Cause A chk.a, chk.s, or fchkf instruction needs to branch to recovery code, and the branching behavior is unimplemented by the processor. This fault cannot be raised by IA-32 instructions. Interruptions on this vector: Speculative Operation fault Parameters IIP, IPSR, IIPA, IFS –...
Page 447
The Speculative Operation fault handler does not need to check for unimplemented instruction addresses. They will be checked automatically by processor hardware when the handler executes its rfi. On processors which report unimplemented instruction addresses with an Unimplemented Instruction Address (UIA) trap, if an emulated check instruction targets an unimplemented address and also needs to take a Single Step trap or Taken Branch trap (or both), the UIA trap will not be raised until after the Single Step and/or Taken Branch trap has been handled, making it appear that the Unimplemented...
Page 448
Debug vector (0x5900) Name Cause A debug fault has occurred. Either the instruction address matches the parameters set up in the instruction debug registers, or the data address of a load, store, semaphore, or mandatory RSE fill or spill matches the parameters set up in the data debug registers.
Page 449
Unaligned Reference vector (0x5a00) Name Cause If PSR.ac is 1, and the data address being referenced by an Itanium instruction is not aligned to the natural size of the load, store, or semaphore operation, or a data reference is made to a misaligned datum not supported by the implementation. “Memory Access Instructions”...
Page 450
Unsupported Data Reference vector (0x5b00) Name Cause An attempt was made to: • Execute a fetchadd, cmpxchg, xchg, or unsupported ld16, st16 or 10-byte memory reference (ldfe or stfe) instruction to a page that is neither cacheable with write-back write policy nor a NaTPage. •...
Page 451
Floating-point Fault vector (0x5c00) Name Cause A floating-point exception fault has occurred. IA-32 numeric instructions can not raise this fault, IA-32 floating point faults are delivered on the IA_32_Exception(Floating-Point) vector. Interruptions on this vector: Floating-Point Exception fault Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description.
Page 452
Floating-point Trap vector (0x5d00) Name Cause A floating-point exception trap has occurred. IA-32 numeric instructions can not raise this trap. Interruptions on this vector: Floating-Point Exception trap Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description. IIB0, IIB1 –...
Page 453
Lower-Privilege Transfer Trap vector (0x5e00) Name Cause Two trapping conditions transfer control to this vector: • An attempt is made to transfer control to an unimplemented address, resulting in either an Unimplemented Instruction Address trap or an Unimplemented Instruction Address fault. See “Unimplemented Address Bits”...
Page 454
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 0 0 0 ss tb 1 0 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 0 ni ir 0 0 0 0 0 0 Notes The Unimplemented Instruction Address trap can be the result of a taken branch, a...
Page 455
Taken Branch Trap vector (0x5f00) Name Cause A taken branch was executed, and the PSR.tb bit is 1. IA-32 instructions can not raise this trap, IA-32 taken branch traps are delivered on the IA_32_Exception(Debug) vector. The Taken Branch trap is not taken on an rfi instruction. Interruptions on this vector: Taken Branch trap Parameters...
Page 456
Single Step Trap vector (0x6000) Name Cause An instruction was successfully executed, and the PSR.ss bit is 1. For IA-32 instruction set, this condition is delivered on the IA_32_Exception(Debug) vector; see Chapter 9, “IA-32 Interruption Vector Descriptions.” IA-32 instructions can not raise this trap, IA-32 single step events are delivered on the IA_32_Exception(Debug) vector.
Page 457
Virtualization vector (0x6100) Name Cause An attempt is made to execute an instruction which requires virtualization. This fault cannot be raised by IA-32 instructions. Interruptions on this vector: Virtualization fault Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description.
Page 458
IA-32 Exception vector (0x6900) Name Cause A fault or trap was raised while executing from the IA-32 instruction set. Interruptions on this vector: IA-32 Instruction Debug fault IA-32 Code Fetch fault IA-32 Instruction Length > 15 bytes fault IA-32 Device Not Available fault IA-32 FP Error fault IA-32 Segment Not Present fault IA-32 Stack Exception fault...
Page 459
IA-32 Intercept vector (0x6a00) Name Cause An intercept fault or trap was raised while executing from the IA-32 instruction set. This vector handles all the IA-32 intercepts described in Chapter 9, “IA-32 Interruption Vector Descriptions.” Interruptions on this vector: IA-32 Invalid Opcode fault IA-32 Instruction Intercept fault IA-32 Locked Data Reference fault IA-32 System Flag Intercept trap...
Page 460
IA-32 Interrupt vector (0x6b00) Name Cause An IA-32 software interrupt trap was executed. This vector handles all the IA-32 software interrupts described in Chapter 9, “IA-32 Interruption Vector Descriptions.” Interruptions on this vector: IA-32 Software Interrupt (INT) trap Parameters IIP, IPSR, IIPA, IFS – are defined; refer to page 2:165 for a detailed description.
Page 461
EFLAG.tf is 1. b0 to b3 Data breakpoint trap due to a match with the corresponding Intel Itanium data breakpoint registers. Each bit indicates a match with the corresponding DBR registers; b0=DBR0/1, b1=DBR2/3, b2=DBR4/5, b3=DBR6/7. Zero, one or more bits may be set.
Page 462
IA_32_Exception (Divide) – Divide Fault Name ® Cause IA-32 IDIV or DIV instruction attempted a divide by zero operation. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault. Parameters IIP – virtual IA-32 instruction address zero extended to 64-bits.
Page 463
The Itanium architecture debug facilities triggered an IA-32 code breakpoint fault on a ® IA-32 instruction fetch and PSR.id and EFLAG.rf are 0. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault.
Page 464
In the Itanium System Environment, IA-32 Mov SS or Pop SS single step and data breakpoint traps are NOT deferred to the next instruction. Refer ® to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this trap.
Page 465
IA_32_Exception (Break) – INT 3 Trap Name ® Cause IA-32 breakpoint instruction (INT 3) triggered a trap. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this trap. Parameters IIPA – trapping virtual IA-32 instruction address zero extended to 64-bits.
Page 466
IA_32_Exception (Overflow) – Overflow Trap Name ® Cause IA-32 INTO instruction execution when EFLAG.of is set to one. Refer to the Intel and IA-32 Architectures Software Developer’s Manual for a complete definition of this trap. Parameters IIPA – trapping virtual IA-32 instruction address zero extended to 64-bits.
Page 467
IA_32_Exception (Bound) – Bounds Fault Name ® Cause Failed IA-32 Bound check instruction. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault. Parameters IIP – virtual IA-32 instruction address zero extended to 64-bits.
Page 468
IA_32_Exception (InvalidOpcode) – Invalid Opcode Fault Name Cause All IA-32 invalid opcode faults are delivered to the IA_32_Intercept(Instruction) handler, including IA-32 illegal, unimplemented opcodes, MMX technology and SSE instructions if CR0.EM is 1, and SSE instructions if CR4.fxsr is 0. All illegal IA-32 floating-point opcodes result in an IA_32_Intercept(Instruction) regardless of the state of CR0.em.
Page 469
The processor executed an IA-32 ESC or floating-point instruction with CR0.em is 1. Or an IA-32 WAIT, ESC, floating-point instruction, MMX technology or SSE instruction is executed and CR0.ts bit is 1. ® Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault. Parameters IIP –...
Page 470
Double Fault Name Cause IA-32 Double Faults (IA-32 vector 8) are not generated by the processor in the Itanium System Environment. 2:222 Volume 2, Part 1: IA-32 Interruption Vector Descriptions...
Page 471
Invalid TSS Fault Name Cause IA-32 Invalid TSS Faults (IA-32 vector 10) are not generated in the Itanium System Environment. Volume 2, Part 1: IA-32 Interruption Vector Descriptions 2:223...
Page 472
IIPA – virtual address of the faulting IA-32 instruction zero extended to 64-bits. ISR.vector – 11. ® ISR.code – IA-32 defined error code. See Intel 64 and IA-32 Architectures Software Developer’s Manual. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9...
Page 473
IA-32 defined set of stack segment fault conditions detected during stack segment load ® operations or memory references relative to the stack segment, refer to the Intel and IA-32 Architectures Software Developer’s Manual for a complete list of all IA-32 faulting conditions. Stack faults can also be generated when the processor detects an inconsistent stack segment register descriptor value during an IA-32 stack reference instruction (e.g.
Page 474
IA-32 defined set of data and code segment fault conditions detected during data or code segment load operations or memory references relative to code or data segments, ® refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete list of all IA-32 General Protection Fault conditions. General Protection faults...
Page 475
Page Fault Name Cause IA-32 defined page faults (IA-32 vector 14) can not be generated in the Itanium System Environment. Volume 2, Part 1: IA-32 Interruption Vector Descriptions 2:227...
Page 476
Itanium System Environment. IA-32 numeric exception delivery is not triggered by Itanium numeric exceptions or the execution of Itanium numeric instructions. Refer to ® the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault.
Page 477
An IA-32 instruction performed an unaligned data memory reference while PSR.ac is 1, or EFLAG.ac is 1 and CR0.am is 1 and the effective privilege level is 3. Refer to the ® Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault.
Page 478
Machine Check Name Cause IA-32 Machine Check (IA-32 vector 18) is not generated in the Itanium System Environment. 2:230 Volume 2, Part 1: IA-32 Interruption Vector Descriptions...
Page 479
SSE instruction. SSE instructions do NOT trigger the report of any pending IA-32 floating-point exceptions. SSE instructions ® always ignore CR0.ne and the IGNNE pin. Refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for a complete definition of this fault.
Page 480
IA_32_Interrupt (Vector #N) – Software Trap Name Cause The IA-32 INT n instruction forces an IA-32 interrupt trap. The IA-32 IDT is not consulted nor are any values pushed onto a memory stack. Parameters IIPA – trapping virtual IA-32 instruction address (points to the INT instruction) zero extended to 64-bits.
Page 481
INT1, SIDT, SGDT, SLDT, SMSW, WBINVD, WRMSR, and all other unimplemented and illegal opcode patterns. If CR0.em is 1, execution of all IA-32 Intel MMX technology and IA-32 SSE instructions results in this intercept. If CR4.FXSR is 0, execution of all IA-32 SSE instructions results in this intercept.
Page 482
Figure 9-3. IA-32 Intercept Code 15 14 13 12 11 10 9 sp np rp lp as os 0 Table 9-1. Intercept Code Definition Field Bits Description Operand Size – (OperandSize Prefix XOR CSD.d bit). When 1, indicates the effective operand size is 32-bits, when 0, 16-bits. Address Size –...
Page 483
IA_32_Intercept (Gate) – Gate Intercept Trap Name Cause If an IA-32 control transfer is initiated through a GDT/LDT descriptor that transfers control through a Call Gate, Task Gate or Task Segment this interception trap is generated. Parameters IIPA – trapping virtual IA-32 instruction address zero extended to 64-bits. IIP –...
Page 484
IA_32_Intercept (SystemFlag) – System Flag Trap Name Parameters System Flag Intercept Traps are generated for the following conditions: CLI, STI, POPF, POPFD instructions. If the EFLAG.if bit changes state and CFLG.ii is 1, or EFLAG.tf or EFLAG.ac change state, a System Flag intercept notification trap is delivered after the instruction completes.
Page 485
IA_32_Intercept (Lock) – Locked Data Reference Fault Name Cause For IA-32 locked operations, if the DCR.lc bit is 1, and an atomic operation to made to non-write-back memory or to unaligned write-back memory that would result in a read-modify-write sequence being performed externally under an external bus lock, the processor raises a Locked Data Reference fault.
Page 487
® Itanium Architecture-based Operating System Interaction Model with IA-32 Applications This section describes the IA-32 system execution model from the perspective of an Itanium architecture-based operating system interfacing with IA-32 code, while operating in the Itanium System Environment. The main features covered are: •...
Page 488
Control Registers unmodified, Controls instruction set execution (including IA-32) shared IFA, IIP, Intel Itanium interruption registers may be overwritten on IPSR, ISR, any TLB fault, interruption or exception encountered IIM, IIPA, during IA-32 or Intel Itanium instruction set execution. shared...
Page 489
When Itanium architecture-based software loads these registers, no data integrity checks are performed at that time if illegal values are loaded in any fields. For a ® complete definition of all bit fields and field semantics refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual.
Page 490
The TSSD descriptor points to the I/O Permission Bitmap. If CFLG.io is 1, IN, INS, OUT, ® and OUTS consult the TSSD I/O permission bitmap as defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual. If CFLG.io is 0, the TSSD I/O permission bitmap is not checked.
Page 491
10.3.1 IA-32 Current Privilege Level PSR.cpl is the current privilege level of the processor for instruction execution (including IA-32). PSR.cpl is used by the processor for all IA-32 descriptor segmentation and paging permission checks. PSR.cpl is a secured register. Typical IA-32 processors used SSD.dpl as the official privilege level of the processor.
Page 492
If CFLG.ii is 1, successful modification of the IF-bit by CLI, STI, or POPF results in an IA_32_Intercept(SystemFlag) trap, otherwise the IF-bit is modified without interception. Modification of this bit by Intel Itanium instructions does not result in an ®...
Page 493
13:12 IA-32 In/Out Privilege Level, controls accessibility by IA-32 IN/OUT instructions to the I/O port space and permission to modify the IF-bit for Intel Itanium and IA-32 instructions. If PSR.cpl > IOPL, permission is denied for IA-32 IN/OUT instructions, and modifications of EFLAG.if by either IA-32 or Intel Itanium instructions are ignored.
Page 494
64 and IA-32 Architectures Software Developer’s Manual for details. Affects execution of POPF, PUSHF, CLI and STI. This bit is supported in both the IA-32 and Intel Itanium System Environments. A IA-32 Code Fetch fault (GPFault(0)) is generated on every IA-32 instruction (including the target of rfi and br.ia), if the following condition is true:...
Page 495
CFLG.mp is 1, execution of IA-32 FWAIT/WAIT instructions results in an IA_32_Exception(DNA) fault. This bit is ignored by Intel Itanium instructions. This bit is supported in both the IA-32 and Intel ® Itanium System Environments. See Intel 64 and IA-32 Architectures Software Developer’s...
Page 496
CR0.NE CFLG.ne Numeric Error: Numeric errors are always enabled in the Intel Itanium System Environment. The NE bit and the IGNNE# pin are ignored by the processor and the FERR# pin is not asserted for any numeric errors on IA-32 or Intel Itanium floating-point instructions.
Page 497
Itanium architecture-based code does NOT have any side effects such as flushing the ® TLBs. This bit is supported as defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual for the IA-32 System Environment.
Page 498
IA-32 Architectures Software Developer’s Manual for the IA-32 System Environment. CR4.PGE CFLG.pge Paging Global Enable: This bit is ignored in the Intel Itanium System Environment. This bit is provided as storage for compatibility purposes. This bit is ® supported as defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual for...
Page 499
CR4.pce is 1. Otherwise execution of the RDPMC instruction results in a GPFault. CFLG.pce is ignored by Intel Itanium instructions. This bit is supported in both the IA-32 and Intel ® Itanium System Environments. See the Intel and IA-32 Architectures Software Developer’s Manual for details on these bits.
Page 500
10.3.3.3 IA-32 Memory Type Range Registers (MTRRs) Within the Itanium System Environment, IA-32 MTRR registers are superseded by physical memory attributes supplied by the TLB, as defined in Section 4.4.3, “Cacheability and Coherency Attribute” on page 2:77. IA-32 instruction references to the MTRRs in the MSR register space results in an instruction intercept fault.
Page 501
Table 10-5 summarizes IA-32 instruction behavior within the Itanium System ® Environment. All IA-32 instructions are unchanged from the Intel 64 and IA-32 Architectures Software Developer’s Manual except where noted. IA-32 instructions can also generate additional Itanium register and memory faults as defined in ®...
Page 502
® Table 5-6. Please refer to the Intel 64 and IA-32 Architectures Software Developer’s Manual for the behavior of all IA-32 instructions in the IA-32 System Environment. For all listed and unlisted IA-32 instructions in Table 10-5 the following relationships hold: •...
Page 503
Table 10-5. IA-32 Instruction Summary (Continued) ® ® Intel Itanium System IA-32 Instruction Comments Environment CMPXCHG, 8B Optional Lock Intercept If Locks are disabled (DCR.lc is 1) and a processor external lock transaction is required CPUID CWD, CDQ CVTPI2PS, CVTPS2PI,...
Page 505
IMUL IN, INS unchanged + I/O ports are If CFLG.io is 0, the TSS I/O permission bitmap is mapped virtually not consulted. Intel Itanium TLB faults control accessibility to I/O ports. unchanged INT 3, INTO Mandatory Exception vector Delivered as an IA_32_Interrupt...
Page 506
ORPS OUT, OUTS unchanged + I/O ports are If CFLG.io is 0, the TSS I/O permission bitmap is mapped virtually not consulted. Intel Itanium TLB faults control accessibility to I/O ports. PACKSS, PACKUS PADD, PADDS, PADDUS PAND, PANDN PCMPEQ, PCMPGT...
Page 507
Table 10-5. IA-32 Instruction Summary (Continued) ® ® Intel Itanium System IA-32 Instruction Comments Environment near: no change far: no change less privilege: no change same privilege: no change + additional taken branch trap If PSR.tb is 1, raise a taken branch trap.
Page 508
Zero Index tation Extend Displacement ® ® Intel Itanium Base 10.6.1 Virtual Memory References In the Itanium System Environment the following virtual memory options are available for supporting IA-32 and Itanium memory references. • Software TLB fills (TLBs are enabled, but the VHPT is disabled).
Page 509
10.6.2 IA-32 Virtual Memory References By definition, IA-32 instruction and data memory references are confined to 32-bits of virtual addressing, the first 4 G-bytes of virtual region 0. However, IA-32 memory references can be mapped anywhere within the implemented physical address space by operating system code.
Page 510
Figure 10-5. Physical Memory Addressing 64-bit 16/32-bit Physical Address Effective Address PA{63:32}=0 Base PA{31:0} IA-32 Segmen- Index tation Displacement PA{63:0} ® ® Intel Itanium Base ® 2:262 Volume 2, Part 1: Itanium Architecture-based Operating System Interaction Model with IA-32 Applications...
Page 511
10.6.6 Supervisor Accesses If the processor is operating in the Itanium System Environment, supervisor override is disabled, and LDT, GDT, TSS references are performed at the privilege level specified by PSR.cpl. Unaligned processor references to LDT, GDT, and TSS segments will never generate an EFLAG.ac enabled IA-32 Exception (AlignmentCheck) fault, even if PSR.cpl equals 3 and supervisor override is disabled.
Page 512
10.6.8 Atomic Operations All Itanium load/stores and IA-32 non-locked memory references up to 64-bits that are aligned to their natural data boundaries are atomic. Both IA-32 and Itanium atomic semaphore operations can be performed on the same shared memory location. The processor ensures IA-32 locked read-modify-write opcodes and Itanium semaphore operations are performed atomically even if the operations are initiated from the other instruction set by the same processors, or between multiple processors in an multiprocessing system.
Page 513
• All IA-32 read-modify-write or locked instructions have memory fence semantics. All buffered stores are flushed. ® • IA-32 IN, OUT and serializing operations (as defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual) have memory fence semantics.
Page 515
Itanium loads and stores by issuing an acquire operation (or mf) before the instruction set transition. ® ® 10.6.10.1.2 Transitions from IA-32 Instruction Set to Intel Itanium Instruction Set • All data dependencies are honored, Itanium loads see the results of all prior Itanium and IA-32 stores.
Page 516
Figure 10-1. I/O Port Space Model Virtual Address Space Physical Address Space Memory Mapped I/O Memory Map I/O IA-32/Intel® Itanium® Loads/Stores 64MB Platform I/O Ports IN/OUT I/O Ports 64MB IA-32 IN, OUT Platform Physical I/O Block IA-32/Intel® Itanium® Loads/Stores IOBase In the Itanium System Environment, the virtual location of the 64 MB I/O port space is determined by operating system.
Page 517
IA-32 Shift Port{15:2} I/O Port Left Number 12-bits Port{11:0} ® Intel ® Itanium I/O Port Load, Address Store For IA-32 IN and OUT instructions a port’s virtual address is computed as: port_virtual_address = IOBase | (port{15:2}<<12) | port{11:0} This address computation places 4 ports on each 4K page and expands the space to 64MB, with the ports being at a relative offset specified by port{11:0} within each 4K-byte virtual page.
Page 518
Operating System Warning: Operating system code can not remap a given port to another port address within the I/O port space, such that port_physical_address{21:12} != port_physical_address{11:2}. Otherwise, based on the processor model, I/O port data may be placed on the wrong bytes of the processor’s bus and the port will not be correctly accessed.
Page 519
10.7.3 IA-32 IN/OUT instructions ® IA-32 I/O instructions (IN, OUT, INS, OUTS) defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual are augmented as follows: • I/O instructions first check for IOPL permission. If PSR.cpl<=EFLAG.iopl, access permission is granted.
Page 520
• If data translations are disabled (PSR.dt is 0) or the referenced I/O port is mapped to an unimplemented virtual address (via the IOBase register), a GPFault is raised on the referencing IA-32 IN, OUT, INS, or OUTS instruction. • Alignment and Data Address breakpoints are also checked and may result in an IA_32_Exception(AlignmentCheck) fault (if PSR.ac is 1) or IA_32_Exception(Debug) trap.
Page 521
[mf] //Fence prior memory references, if required add port_addr = IO_Port_Base, Expanded_Port_Number ld.acq data, (port_addr) [mf.a] //Wait for platform acceptance, if required [mf] //Fence future memory references, if required 10.8 Debug Model The debug facilitates defined by the Itanium architecture are designed to support debugging of both the Itanium and IA-32 instruction set.
Page 522
10.8.1 Data Breakpoint Register Matching Each Itanium data breakpoint register has the following matching behavior for IA-32 instruction set data memory references: • DBR.addr IA-32 single or multi-byte data memory references that access ANY – memory byte specified by the DBR address and mask fields results in a debug breakpoint trap regardless of datum size and alignment.
Page 523
3) record the state of IA-32 execution at the point of interruption. For IA-32 exceptions, ISR contains IA-32 defined error codes and ® vector numbers as defined by the Intel 64 and IA-32 Architectures Software Developer’s Manual. IA-32 instruction set related exceptions and software...
Page 524
IA_32_Exception (Debug) TrapCode IA-32 debug events. The Trap Code indicates concurrent taken branch, data breakpoint and single step trap conditions. External Interrupt NMI is delivered through the Intel Itanium External Interrupt vector. IA_32_Exception(Break) TrapCode IA-32 INT 3 instruction. IA_32_Exception(INTO) TrapCode IA-32 INTO detected overflow trap.
Page 525
IA-32 numeric instructions follow the IA-32 delayed floating-point exception model. Specifically IA-32 numeric exceptions are held pending until the next IA-32 numeric or ® MMX technology instruction as defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual. Numeric faults generated on SSE instructions are reported precisely on the faulting SSE instruction.
Page 526
transactions. For IA-32 code, if the platform does not support LOCK or SPLCK, the operating system must disable external bus lock transactions by setting DCR.lc to 1. When DCR.lc is 1, any IA-32 atomic reference not serviced internally in the processor’s caches results in an IA_32_Intercept(Lock) fault.
Page 527
Processor Abstraction Layer This chapter defines the architectural requirements for the Processor Abstraction Layer (PAL) for all processors based on the Itanium architecture. It is intended for processor designers, firmware/BIOS designers, system designers, and writers of diagnostic and low level operating system software. PAL is part of the Itanium processor architecture and its goal is to provide a consistent firmware interface to abstract processor implementation-specific features.
Page 528
Figure 11-1. Firmware Model Operating System Software UEFI Power mgmt, OS Boot runtime hot-plug, Transfers Instruction services Handoff etc. to OS Execution entrypoints Unified Extensible Firmware Interface (UEFI) procedure calls OS Boot Interrupts, Advanced Selection traps, and Configuration faults System Abstraction Layer and Power Interface (SAL)
Page 529
PAL encapsulates those processor functions that are likely to change on an implementation to implementation basis so that SAL firmware and operating system software can maintain a consistent view of the processor. These include non-performance critical functions dealing such as processor initialization, configuration and error handling.
Page 530
11.1.3 PAL Entrypoints The following hardware events can trigger the execution of a PAL entrypoint: • Power-on/reset • Hardware errors (both correctable and uncorrectable) • Initialization event (via external interrupt bus message or processor pin) • Platform management interrupt (via external interrupt bus message or processor pin) These hardware events trigger the execution of one of the following PAL entrypoints (as shown in...
Page 531
11.1.5 OS Entrypoints There are several entrypoints from SAL into an operating system (or equivalent software). Entrypoints from SAL into the operating system are expected to meet the following model: • OS_BOOT Operating System Boot interface. – • OS_MCA Operating System Machine Check Abort Handler. –...
Page 534
• The 8 bytes at 0xFFFF_FFE0 (4GB-32) contain the physical address of the Firmware Interface Table. • The 16 bytes at 0xFFFF_FFD0 (4GB-48) contain the FIT entry for the PAL_A (or generic PAL_A in the split PAL_A model) code provided by the processor vendor. The format of this FIT entry is described in Figure 11-6.
Page 535
At a minimum, all of the PAL firmware components, pointers at the top of the firmware address space, FIT tables and the portion of the SAL code that is executed at the RECOVERY CHECK hand-off must be accessible from the processor without any special system fabric initialization sequence.
Page 536
Figure 11-6. Firmware Interface Table Entry 56 55 32 31 24 23 48 47 Start + 16 Chksum Type Version (2 bytes) Reserved Size (3 bytes) Start + 8 Address (8 bytes) Start of entry • Size A 3-byte field containing the size of the component in bytes divided by 16. –...
Page 537
11.2 PAL Power On/Reset 11.2.1 PALE_RESET The purpose of PALE_RESET is to initialize and test the processor. Upon receipt of a power-on/reset event the processor begins executing code from the PALE_RESET entrypoint in the firmware address space. PALE_RESET initializes the processor and may perform a minimal processor self test.
Page 538
• GR34 contains the physical address for making a PAL procedure call. If the call is for RECOVERY CHECK, only the subset of PAL procedures needed for SALE_ENTRY to perform firmware recovery will be available. These procedures are: • PAL_FREQ_RATIOS •...
Page 539
• PSR: PSR.bn is 1; PSR.df1 and PSR.dfh are 1 if the floating-point unit failed self test. All other PSR bits are 0. PSR.ic and PSR.i are zero to ensure external interrupts, NMI and PMI interrupts are disabled. • CRs: The contents of all control registers are undefined except the following: •...
Page 540
• status – A function-dependent 8-bit field indicating the firmware status on entry to SALE_ENTRY. If the function value is RESET or RECOVERY_CHECK, the status values are: Table 11-4. status Field Values Status Value Description Normal Normal reset. FIT Header Failure FIT header for FIT and alternate FIT (if supported) is incorrect FIT Checksum Failure...
Page 541
Table 11-4. status Field Values (Continued) Status Value Description PAL_B Auth Failure / Good PAL_B found One or more compatible PAL_B's failed authentication and checksum. Another compatible PAL_B was found that passed authentication and checksum. 64K Unaligned No PAL_B was found in the FIT and alternate FIT (if supported) that was correctly aligned to a 64KB boundary.
Page 542
• state A 2-bit field indicating the state of the processor after self-test. If SAL – directed PAL to skip some self-tests by modifying the self-test control word, failures related to these self-tests will not be reflected in this state. Table 11-6.
Page 543
• test_status An unsigned 32-bit-field providing additional information on test – failures when the state field returns a value of PERFORMANCE RESTRICTED or FUNCTIONALLY RESTRICTED. The value returned is implementation dependent. 11.2.3 PAL Self-test Control Word The PAL self-test control word is a 48-bit value. This bit field is defined in Figure 11-10.
Page 544
11.3 Machine Checks 11.3.1 PALE_CHECK When a machine check abort (MCA) occurs, PALE_CHECK is responsible for saving minimal processor state to a uncacheable platform-specific memory location previously registered with PAL via the PAL_MC_REGISTER_MEM procedure. This platform location is called the Minimal State Save Area (min-state save area) and is described in Section 11.3.2.4, “Processor Min-state Save Area Layout”...
Page 545
For testing and configuration purposes, it may be necessary for software to intentionally generate a machine check. In this case PALE_CHECK will log the error information, but not attempt recovery before branching to SALE_ENTRY. To allow for this, the PAL_MC_EXPECTED procedure call is defined to indicate that PALE_CHECK should not to attempt recovery.
Page 546
• GR16 through GR20 (bank 0) contain parameters which PALE_CHECK passes to SALE_ENTRY for diagnostic and recovery purposes: • GR16 contains the address to the first available location in the min-state save area for use by SAL. The address is 8-byte aligned. •...
Page 547
• Cache: The processor internal cache is enabled and is unchanged from the time of the MCA except for any lines that were invalidated to correct the error. • TLB: The TCs may be initialized and the TRs are unchanged from the time of the MCA.
Page 548
Table 11-7. Processor State Parameter Fields (Continued) Field Bits Description Trap lost. A value of 1 indicates the machine check occurred after an instruction was executed but before a trap that resulted from the instruction execution could be generated. More information. A value of 1 indicates that more error information about the machine check event is available by making the PAL_MC_ERROR_INFO procedure call.
Page 549
11.3.2.1.1 Using Processor State Parameter to Determine if Software Recovery of a Machine Check is Possible The us, ci, co, and sy bits in the Processor State Parameter are valid only if the error has not been previously corrected in hardware or firmware (cm bit is 0). Even then, only the bit combinations shown in Table 11-8 are valid.
Page 550
After return from the SAL rendezvous call, PALE_CHECK will complete processing the machine check if the rendezvous was successful and then branch to SALE_ENTRY with GR19 set to zero. The processor state when transferring to SAL is as defined in Section 11.3.2, “PALE_CHECK Exit State”...
Page 551
area is architectural state needed by the PAL code to resume during MCA and INIT events (architected min-state save area + reserved). The remaining space in the buffer is a scratch space reserved exclusively for PAL use, therefore SAL and OS must not use this area.
Page 552
Figure 11-2. Processor State Saved in Min-state Save Area 0xf8 Bank 0 GR31 0xf0 Bank 0 GR30 0xe8 Bank 0 GR29 0xe0 Bank 0 GR28 GR16 0xd8 Bank 0 GR27 0xd0 Bank 0 GR26 0x1c8 0xc8 Bank 0 GR25 0xc0 Bank 0 GR24 0x1c0 XFS or undefined...
Page 553
The NaT bits stored in the first entry of the min-state save area have the following layout. Figure 11-3. NaT Bits for Saved GRs 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 NaT bits for Bank 0 GR16 to GR31 NaT bits for GR15 to GR1 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32...
Page 554
There are certain error cases that may require returning to a new context in order to recover from the machine check. If this occurs a new context can be returned to via the PAL_MC_RESUME procedure with the new_context flag set. The caller needs to set up the new processor min-state save area as shown in Figure 11-2 for all the listed...
Page 555
• If recovery is not supported when PSR.ic=0 then GR24 - GR31 (bank 0) are undefined and their contents have been lost. In this case, recovery is not possible. See Section 11.3.1.1, “Resources Required for Machine Check and Initialization Event Recovery” for details. •...
Page 556
• DBR/IBRs: The contents of all breakpoint registers are unchanged from the time of the INIT. • PMCs/PMDs: The contents of the PMC registers are unchanged from the time of the INIT. The contents of the PMD registers are not modified by PAL code, but may be modified if events it is monitoring are encountered.
Page 557
Table 11-12. Processor State Parameter Fields (Continued) INIT Field Bits Description value Uncontained storage damage. A value of 1 indicates the error is contained within the CPU and memory hierarchy, but that some memory locations may be corrupt. If us is set to 1, then co and sy will always be cleared to 0.
Page 558
Table 11-12. Processor State Parameter Fields (Continued) INIT Field Bits Description value Register file check. A value of 1 indicates that a register file related machine check occurred. See the PAL_MC_ERROR_INFO procedure call for more information. Uarch check. A value of 1 indicates that a micro-architectural related machine check occurred.
Page 559
to register its PALE_PMI entrypoint, processor operation is undefined. If a SAL related PMI is seen before the SAL PMI handler is registered, the PAL PMI code will just return to the interrupted context Figure 11-7. PMI Entrypoints PALE_PMI SALE_PMI The hardware events that can cause the PMI request are referred to as PMI events.
Page 560
Table 11-15. PMI Message Vector Assignments Priority Vector Description PAL Reserved High IA-32 Machine Check Rendezvous PAL Reserved 11.5.2 PALE_PMI Exit State The state of the processor on exiting PALE_PMI is: • GRs: The contents of non-banked general registers are unchanged from the time of the interruption.
Page 561
• BR0 PAL PMI return address. • ARs: The contents of all application registers are unchanged from the time of the interruption, except the RSE control register (RSC) and the ITC and RUC counters. The RSC.mode field will be set to 0 (enforced lazy mode) while the other fields in the RSC are unchanged.
Page 562
Figure 11-8 shows state transitions for the various power states and the software interfaces required for the transitions. Figure 11-8. Power States NORMAL/ LOW-POWER PAL_HALT_LIGHT PAL_HALT Unmasked external Unmasked external interrupts, Machine Interrupts, Machine check, Reset, PMI check, Reset, PMI and INIT and INIT LIGHT HALT...
Page 563
implement. It is the responsibility of the caller to ensure cache coherency in this state. • HALT 2 - 7 These are optional implementation-dependent states entered by – calling PAL_HALT with a power state argument in the range of 2-7. Before making this procedure call, the operating system software should first ascertain that the states are implemented by calling PAL_HALT_INFO.
Page 564
Figure 11-9. Power and Performance Characteristics for P-states Power Performance P-states can be utilized by software to implement a demand-based dynamic power management policy where it would continuously try to adapt the processor performance to the current workload characteristics. This allows software to achieve power savings at the system level, while allowing it to quickly respond to changing workload requirements.
Page 565
Figure 11-10. Example of a P-state Transition Policy Halt High Utilization Transitions initiated by software Utilization 11.6.1.1 Power Dependency Domains The concept of P-states applies to each logical processor, and this gives software the required granularity to individually control the power/performance characteristics for each available thread of execution in the system.
Page 566
parameters. Each P-state maps to a set of values for the domain parameters, and hence a P-state transition results in a change in the underlying power/performance characteristics for the logical processor. The Itanium architecture supports different types of dependency domains, which enables software to have different degrees of control for P-state changes affecting logical processors in the domain.
Page 567
A hardware-independent dependency domain (HIDD) is a self-contained domain that typically means that every logical processor is the only logical processor in that domain, and its domain parameters are individually controllable. Since there are no dependencies with any other logical processors, there is no P-state coordination needed for such domains.
Page 568
procedure, and the caller is expected to make another PAL_SET_PSTATE request to transition to the desired P-state. The transition_latency_2 field in the pstate_buffer returned by PAL_PSTATE_INFO indicates the time interval the caller needs to wait to have a reasonable chance of success when initiating another PAL_SET_PSTATE call. Implementation-specific event conditions may prevent a PAL_SET_PSTATE request from being accepted (e.g., due to a thermal protection mechanism), in which case the PAL procedure returns a status of transition failure.
Page 569
initiates a new performance_index count, which is reported when the next PAL_GET_PSTATE procedure call is made. A call to PAL_GET_PSTATE with a type operand of 1 resets the performance measurement logic. SCDD: If the logical processor belongs to a software-coordinated dependency domain, the performance index returned (for either type=0 or 3) corresponds to the target P-state requested by the most recent successful PAL_SET_PSTATE procedure call.
Page 570
As seen above, for a HCDD, the PAL_GET_PSTATE procedure allows the caller to get feedback on the dynamic performance of the processor over a software-controlled time period. The caller can use this information to get better system utilization over a subsequent time period by changing the P-state in correlation with the current workload demand.
Page 571
For example, let's say the minimum frequency of P0 is 1 GHz and the maximum frequency of P0 is 1.5 GHz. If we are at 1 GHz for a time period of 4, 1.25 GHz for a time period of 16 and 1.5 GHz for a time period of 20, the average performance index ((100 * 4) + (125 * 16) + (150 * 20)) / (5 + 15 + 20) = 135 The performance_index equation for other P-states can be calculated in a similar manner using their respective frequency index values.
Page 572
Figure 11-12. Interaction of P-states with HALT State Performance (P0) (P1) (P2) Enter HALT State Exit HALT State (P3) Time (Previous) GET SET(P3) (Current) GET As shown above, the value returned for performance_index does not account for the performance during the time spent by the logical processor in the HALT state. This provides for better accuracy in the value reported for performance_index, allowing the caller to make optimal adjustments to the system utilization even in scenarios where we have interactions between P-states and HALT state.
Page 573
The VMM is responsible for managing the set of available system resources (CPU, memory, peripherals) and implement policies to virtualize these resources. In order to support virtual processor operations, the VMM will create a virtual environment and associate logical processors with the virtual environment. A virtual environment consists of one or more logical processors plus the memory resource allocated by the VMM during PAL_VP_INIT_ENV.
Page 574
Table 11-16. Virtual Processor Descriptor (VPD) Name Entries Offset Description Class Virtualization Acceleration Control – these con- Control [always] trol bits enable virtualization acceleration of a particular resource or instruction. See Section 11.7.1.1, “Virtualization Controls” on page 2:329 for details. Virtualization Disable Control –...
Page 575
Table 11-16. Virtual Processor Descriptor (VPD) (Continued) Name Entries Offset Description Class Reserved 1336 Reserved Area – Reserved for future expan- Reserved sion. vpsr 1424 Virtual Processor Status Register – Represents Architectural State the Processor Status Register of the virtual pro- Table 11-17 cessor.
Page 576
Table 11-17. Virtual Processor Descriptor (VPD) – VPSR Field Bits Class User Mask = PSR{5:0} Reserved No accelerations require these fields. System Mask = PSR{23:0} Always a_int, a_from_psr a_from_psr 12:6, 16 Reserved a_from_psr Always PSR.l = PSR{31:0} a_from_psr 31:28 Reserved PSR{63:0} 33:32 No accelerations require these fields.
Page 577
Table 11-18. Virtual Processor Descriptor (VPD) – VCR[0-127] Register Name Class VCR0-15 No accelerations require these virtual control registers. VCR16 VIPSR a_from_int_cr, a_to_int_cr VCR17 VISR VCR18 No accelerations require this virtual control register. VCR19 VIIP a_from_int_cr, a_to_int_cr VCR20 VIFA Always VCR21 VITIR Always...
Page 578
Table 11-19. Virtualization Acceleration Control (vac) Fields (Continued) Field Description a_to_int_cr Enable the interruption control register (CR16-27) write optimization. See Section 11.7.4.2.3, “Interruption Control Register Write Optimization” on page 2:341 for details. a_from_psr Enable the processor status register read optimization. See Section 11.7.4.2.4, “MOV-from-PSR Optimization”...
Page 579
Table 11-20. Virtualization Disable Control (vdc) Fields (Continued) Field Bits Description d_to_pmd Disable PMD write virtualization – If 1, writes to the performance monitor data registers (PMDs) are not virtualized. Code running with PSR.vm==1 can write the performance monitor data registers of the logical processor directly and without handling off to the VMM.
Page 580
interruptions except the Virtualization vector. Virtualization vector will be delivered as virtualization intercept in the per-virtual-processor host IVT. See Section 11.7.3, “PAL Intercepts in Virtual Environment” on page 2:332 for details on PAL intercepts. In the virtual environment, the IVA (CR2) control register will be set by PAL virtualization-related procedures and services as summarized in Table 11-21.
Page 581
Section 11.7.3.1, “PAL Virtualization Intercept Handoff State” on page 2:333 describes the handoff state of the PAL intercepts. For all interruption vectors other than Virtualization vector, the architectural state at the corresponding IVA-based interruption vector is the same as defined in Chapter 8, “Interruption Vector Descriptions”...
Page 582
• IRRs: The contents of IRRs are not changed by PAL. Incoming interruptions may change the contents. • IFS: IFS is unchanged from the time of the interruption. • IIP: Contains the value of IP at the time of the interruption. •...
Page 583
Table 11-22. PAL Virtualization Intercept Handoff Cause (GR24) (Continued) Value Cause Description ptc_g Due to ptc.g instruction. ptc_ga Due to ptc.ga instruction. ptr_d Due to ptr.d instruction. ptr_i Due to ptr.i instruction. thash Due to thash instruction. ttag Due to ttag instruction. Due to tpa instruction.
Page 584
resource and perform the virtualized operations based on the virtual instance of the resource without handling off to the VMM. Section 11.7.4.2, “Virtualization Accelerations” on page 2:337 describes the supported Virtualization accelerations in the architecture. • Virtualization disables – Virtualization disables optimize the execution of virtualized instructions by disabling virtualization of a particular resource or instruction.
Page 585
11.7.4.1.2 Virtualization Cause Optimization Virtualization cause optimization is enabled by the cause bit in the config_options parameter of PAL_VP_INIT_ENV. When enabled, the causes of virtualization intercepts will be provided to the VMM during PAL intercept handoffs within the virtual environment. When disabled, no cause information will be provided during PAL intercept handoffs.
Page 587
When this optimization is enabled, execution of rsm and ssm instructions , with PSR.vm==1 and system mask equal to zero (0x0), will not intercept to the VMM unless a fault condition is detected (see Table 11-29 for details). A virtual external interrupt is raised if the virtual highest priority pending interrupt (vhpi) is unmasked by the new vpsr.i and vtpr.
Page 588
Table 11-29. Interruptions when Virtual External Interrupt Optimization is Enabled Instructions Interruptions When the virtual external interrupt optimization is enabled, execution rsm, ssm of rsm and ssm instructions with PSR.vm==1 which modify only vpsr.i, may raise the following faults: • Privileged Operation fault – if vpsr.cpl is not zero MOV-from-TPR When the virtual external interrupt optimization is enabled, execution of MOV-from-CR instruction targeting vtpr with PSR.vm==1, may...
Page 589
Table 11-31. Interruptions when Interruption Control Register Read Optimization is Enabled Instructions Interruptions Move from interruption control registers When the interruption control register read optimization is enabled, reads of interruption control registers with PSR.vm==1, may raise the following faults: • Illegal Operation fault – if vpsr.ic is not zero or the target operand specifies GR 0 or an out-of-frame stacked register •...
Page 590
the virtual processor status register without any intercepts to the VMM; and the last value written to the vpsr will be returned, unless a fault condition is detected (see Table 11-35 for details). The value returned for the fml, mfh, ac, up and be bits are simply the values of those bits in the PSR of the logical processor, since those bits are not virtualized.
Page 591
Table 11-36. Synchronization Requirements for MOV-from-CPUID Optimization VPD Resource Synchronization Required vcpuid0-4 Write Table 11-37. Interruptions when MOV-from-CPUID Optimization is Enabled Instructions Interruptions MOV-from-CPUID When the MOV-from-CPUID optimization is enabled, MOV-from-CPUID instructions with PSR.vm==1, may raise the fol- lowing faults: •...
Page 592
corresponding NaT bits from the VPD. vpsr.bn is updated to reflect the new register bank without any intercepts to the VMM, unless a fault condition is detected (see Table 11-46 for details). If this optimization is disabled, execution of the bsw instruction with PSR.vm==1 results in a virtualization intercept.
Page 593
There is no synchronization requirement for the virtualization of instructions. probe 11.7.4.2.9 Test Feature Optimization The test feature optimization is enabled by the a_tf bit in the Virtualization Acceleration Control (vac) field in the VPD. When this optimization is enabled, test feature (tf) instructions running with PSR.vm==1 will test the VCPUID[4] register in the VPD.
Page 594
When this optimization is enabled, execution of rsm and ssm instructions, with PSR.vm==1 and system mask equal to zero (0x0), will not intercept to the VMM unless a fault condition is detected (see Table 11-45 for details). When PSR.vm==1, execution of rsm and ssm instructions , which modify any bits other than vpsr.ic and user mask fields will result in virtualization intercepts independent of whether this optimization is enabled or not.
Page 595
Table 11-46. Virtualization Disables Summary (Continued) Virtualization Disable Disable Control Description (vdc) Disable ITM Virtualization d_itm Section 11.7.4.3.6 Disable PSR Interrupt-bit Virtualization d_psr_i Section 11.7.4.3.7 a. The Virtualization Disable Control (vdc) field resides in the Virtual Processor Descriptor (VPD), see Section 11.7.1, “Virtual Processor Descriptor (VPD)”...
Page 596
11.7.4.3.4 Disable PMC Virtualization The PMC virtualization disable is controlled by the d_pmc bit in the Virtualization Disable Control (vdc) field in the VPD. When this control is set to 1, accesses (reads/writes) to the performance monitor configuration registers (PMCs) are not virtualized, and code running with PSR.vm==1 can read and write these resources directly without any intercepts to the VMM.
Page 597
11.7.4.4 Virtualization Optimization Combinations Table 11-47 describes the supported combinations of virtualization accelerations and disables. Table 11-47.Supported Virtualization Optimization Combinations d_vmsw d_extint d_ibr_dbr d_pmc d_to_pmd d_itm d_psr_i a_int a_from_int_cr a_to_int_cr a_from_psr a_from_cpuid a_cover a_bsw a_all_probes a_select_probes a_tf a_ic_um a. “o” indicates the corresponding virtualization acceleration and disable can be enabled together. b.
Page 598
1. Read synchronization – When a specific acceleration is enabled, after interruptions and intercepts that occur when PSR.vm was 1, the VMM must invoke PAL_VPS_SYNC_READ to synchronize the related resources before reading their values from the VPD. 2. Write synchronization – When a specific acceleration is enabled, the VMM must invoke PAL_VPS_SYNC_WRITE to synchronize the related resources after modifying their values in the VPD and before resuming the virtual processor.
Page 599
Machine Check (MC) A machine check is a hardware event that indicates that a hardware error or architectural violation has occurred that threatens to damage the architectural state of the machine, possibly causing data corruption. The occurrence of the error triggers the execution of firmware code which records information about the error, and attempts to recover when possible.
Page 600
Scratch When applied to either an entrypoint or procedure, scratch means that the contents of the register has no meaning and need not be preserved. Further the register is available for the storage of local variables. Unless otherwise noted, the register should not be relied upon to contain any particular value after exit.
Page 601
• During the execution of PAL procedures to the memory buffer allocated by the caller of the procedure using the memory attribute of the address passed by the caller. • PAL may also issue loads from the architected firmware address space and loads/stores from the registered min-state save area whenever it is executing a PAL procedure or handling PAL-based interruptions (reset, MCA, INIT and PMI).
Page 602
Table 11-48. PAL Procedure Index Assignment Index Description Reserved 1 - 255 Architected procedures; static register calling conventions 256 - 511 Architected procedures; stacked register calling conventions 512 - 767 Implementation-specific procedures; static registers calling conventions 768 - 1023 Implementation-specific procedures; stacked register calling conventions 1024 + Reserved The assignment of indices for all architected procedures is controlled by this document.
Page 603
Table 11-49.PAL Cache and Memory Procedures (Continued) Procedure Class Conv. Mode Buffer Description PAL_CACHE_PROT_INFO Req. Static Both Return instruction or data cache protection information. PAL_CACHE_SHARED_INFO Opt. Static Both Returns information on which logical processors share caches. PAL_CACHE_SUMMARY Req. Static Both Return a summary of the cache hierarchy.
Page 604
Table 11-50.PAL Processor Identification, Features, and Configuration Procedures Procedure Class Conv. Mode Buffer Description PAL_PROC_SET_FEATURES Req. Static Phys. Enable or disable configurable processor features. PAL_REGISTER_INFO Req. Static Both Return AR and CR register information. PAL_RSE_INFO Req. Static Both Return RSE information. PAL_SET_HW_POLICY Opt.
Page 605
a. Calling this procedure may affect resources on multiple processors. Please refer to implementation-specific reference manuals for details. Table 11-53.PAL Processor Self Test Procedures Procedure Class Conv. Mode Buffer Description PAL_CACHE_LINE_INIT Req. Static Phys. Initialize tags and data of a cache line for processor testing.
Page 606
Table 11-55.PAL Virtualization Support Procedures (Continued) Procedure Class Conv. Mode Buffer Description PAL_VP_SAVE 271 Opt. Stacked Virt. Dep. Save virtual processor state on the logical processor. PAL_VP_TERMINATE 272 Opt. Stacked Virt. Dep. Terminates operation for the specified virtual processor. 11.10.2 PAL Calling Conventions The following general rules govern the definition of the PAL procedure calling conventions.
Page 607
11.10.2.1.3 Making PAL Procedure Calls in Physical or Virtual Mode PAL procedure calls which are made in physical mode must obey the calling conventions described in this chapter, but there are no additional restrictions beyond those noted above. PAL procedure calls made in virtual mode must have the region occupied by PAL_PROC virtually mapped with an ITR.
Page 608
Table 11-56. State Requirements for PSR (Continued) PSR Bit Description Entry Exit Class protection key validation enable unchanged data address translation enable unchanged preserved disabled FP register f2 to f31 unchanged disabled FP register f32 to f127 unchanged unchanged secure performance monitors unchanged privileged performance monitor enable unchanged...
Page 609
Table 11-57. Definition of Terms Term Description Must be zero at entry to the procedure or on exit from the procedure. If the value at entry is not zero, the procedure may return an illegal argument or execute in an undefined manner. Must be one at entry to the procedure or on exit from the procedure.
Page 610
Table 11-58. System Register Conventions (Continued) Name Description Class CMCV Corrected Machine Check Vector unchanged LRR0-LRR1 Local Redirection Registers 0-1 unchanged Region Registers preserved Protection Key Registers preserved Translation Registers unchanged Translation Cache scratch IBR/DBR Break Point Registers preserved Performance Monitor Control Registers preserved Performance Monitor Data Registers unchanged...
Page 611
Table 11-60. General Registers – Stacked Calling Conventions (Continued) Register Conventions GR8 - GR11 scratch, procedure return value GR12 special, stack pointer (sp) GR13 special, thread pointer (tp) GR14 - GR27 scratch GR28 input argument, scratch (PAL Index must be passed in GR28) GR29-GR31 scratch Bank 0 Registers...
Page 613
11.10.3 PAL Procedure Specifications The following pages provide detailed interface specifications for each of the PAL procedures defined in this document. Included in the specification are the input parameters, the output parameters, and any required behavior. Volume 2, Part 1: Processor Abstraction Layer 2:365...
Page 614
PAL_BRAND_INFO PAL_BRAND_INFO – Provides Processor Branding Information (274) Provides processor branding information. Purpose: Stacked Registers Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_BRAND_INFO within the list of PAL procedures. info_request Unsigned 64-bit integer specifying the information that is being requested. (See Table 11-62) address...
Page 615
PAL_BUS_GET_FEATURES PAL_BUS_GET_FEATURES – Get Processor Bus Dependent Configuration Features (9) Provides information about configurable processor bus features. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_BUS_GET_FEATURES within the list of PAL procedures. Reserved Reserved Reserved...
Page 616
PAL_BUS_GET_FEATURES Table 11-63. Processor Bus Features Bits Class Control Description Opt. Req. Disable Bus Data Error Checking. When 0, bus data errors are detected and single bit errors are corrected. When 1, no error detection or correction is done. Opt. Req.
Page 617
PAL_BUS_SET_FEATURES PAL_BUS_SET_FEATURES – Set Processor Bus Dependent Configuration Features (10) Enables/disables specific processor bus features. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_BUS_SET_FEATURES within the list of PAL procedures. feature_select 64-bit vector denoting desired state of each feature (1=select, 0=non-select).
Page 618
PAL_CACHE_FLUSH PAL_CACHE_FLUSH – Flush Data or Instruction Caches (1) Flushes the processor instruction or data caches. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_FLUSH within the list of PAL procedures. cache_type Unsigned 64-bit integer indicating which cache to flush.
Page 619
PAL_CACHE_FLUSH throughout the coherence domain. The procedure will perform the necessary serialization and synchronization as required by the architecture. This call does not ensure that data in the processors coalescing buffers are flushed to memory. See Section 4.4.5, “Coalescing Attribute” on page 2:78 on how to flush the coalescing buffers.
Page 620
PAL_CACHE_FLUSH Table 11-66. Cache Line State when inv = 1 Old State New State Comments Invalid Invalid Clean Invalid Modified Invalid Modified data is copied back to memory. The progress_indicator is an unsigned 64-bit integer specifying the starting position of the flush operation.
Page 621
PAL_CACHE_FLUSH calling this routine. Alternatively, software can disable the TLBs by setting PSR.it, PSR.dt, and PSR.rt to 0. • The specified caches may also contain PAL firmware code cache entries required to flush the cache. • The specified caches may contain PAL and SAL PMI code if this call was made with PSR.ic = 1 and a PMI interrupt is seen during the execution of the call.
Page 622
PAL_CACHE_INFO PAL_CACHE_INFO – Get Detailed Cache Information (2) Returns information about a particular processor instruction or data cache at a specified Purpose: level in the cache hierarchy. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index...
Page 623
PAL_CACHE_INFO cache if the cache contents never get flushed to memory (for example an instruction cache). • stride Unsigned 8-bit integer denoting the binary log of the most effective stride – in bytes for flushing the cache. • store_latency Unsigned 8-bit integer denoting the number of cycles after issue –...
Page 624
PAL_CACHE_INIT PAL_CACHE_INIT – Initialize Caches (3) Initializes the processor controlled caches. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_INIT within the list of PAL procedures. level Unsigned 64-bit integer containing the level of cache to initialize. If the cache level can be initialized independently, only that level will be initialized.
Page 625
PAL_CACHE_LINE_INIT PAL_CACHE_LINE_INIT – Initialize a Data Cache Line (31) Initializes the tags and data of a data or unified cache line of a processor controlled Purpose: cache to known values without the availability of backing memory. Static Calling Conv: Physical Mode: Not dependent Buffer:...
Page 626
PAL_CACHE_PROT_INFO PAL_CACHE_PROT_INFO – Get Detailed Cache Protection Information (38) Returns protection information about a particular processor instruction or data cache at Purpose: a specified level in the cache hierarchy. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument...
Page 628
PAL_CACHE_READ PAL_CACHE_READ – Read Values from the Processor Cache (259) Reads the data and tag of a processor-controlled cache line for diagnostic testing. Purpose: Stacked Registers Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_READ within the list of PAL procedures. line_id 8-byte formatted value describing where in the cache to read the data.
Page 629
PAL_CACHE_READ Table 11-74. part Input Values Value Description data data protection bits tag protection bits combined protection bits for data and tags a. Note that for this part no data is returned. Only protection bits are returned. All other values of part are reserved. The data return value contains the value read from the cache.
Page 630
PAL_CACHE_SHARED_INFO PAL_CACHE_SHARED_INFO – Get Information on Caches Shared by Logical Processors (43) Returns information on caches shared between logical processors. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_SHARED_INFO within the list of PAL procedures. cache_level Unsigned 64-bit integer specifying the level in the cache hierarchy for which information is requested.
Page 632
PAL_CACHE_SUMMARY PAL_CACHE_SUMMARY – Get Cache Hierarchy Summary (4) Returns summary information about the hierarchy of caches controlled by the Purpose: processor. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_SUMMARY within the list of PAL procedures. Reserved Reserved Reserved...
Page 633
PAL_CACHE_WRITE PAL_CACHE_WRITE – Write Values into the Processor Cache (260) Writes the data and tag of a processor-controlled cache line for diagnostic testing. Purpose: Stacked Registers Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_CACHE_WRITE within the list of PAL procedures. line_id 8-byte formatted value describing where in the cache to write the data.
Page 634
PAL_CACHE_WRITE Table 11-77. part Input Values Value Description data data protection tag protection combined data and tag protection All other values of part are reserved. • mesi Unsigned 8-bit integer denoting whether the line should be written as clean – or dirty, shared or exclusive.
Page 635
PAL_CACHE_WRITE To guarantee correct behavior for this procedure, it is required that there shall be no RSE activity that may cause cache side effects. Volume 2, Part 1: Processor Abstraction Layer 2:387...
Page 636
PAL_COPY_INFO PAL_COPY_INFO – Return Parameters to Copy PAL Code to Memory (30) Returns the parameters needed to copy relocatable PAL code from the firmware Purpose: address space to memory. Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index...
Page 637
PAL_COPY_PAL PAL_COPY_PAL – Copy PAL Code to Memory (256) Copy relocatable PAL code from the firmware address space to memory. Purpose: Stacked Registers Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_COPY_PAL within the list of PAL procedures. target_addr Physical address of a memory buffer to copy relocatable PAL procedures and PAL PMI code.
Page 638
PAL_DEBUG_INFO PAL_DEBUG_INFO – Get Debug Registers Information (11) Returns the number of instruction and data debug register pairs. Purpose: Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_DEBUG_INFO within the list of PAL procedures. Reserved Reserved Reserved...
Page 639
PAL_FIXED_ADDR PAL_FIXED_ADDR – Get Fixed Geographical Address of Processor (12) Returns a unique geographical address of this processor. Purpose: Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_FIXED_ADDR call within the list of PAL procedures. Reserved Reserved Reserved...
Page 640
PAL_FREQ_BASE PAL_FREQ_BASE – Get Processor Base Frequency (13) Returns the frequency of the output clock for use by the platform is generated by the Purpose: processor. Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_FREQ_BASE within the list of PAL procedures.
Page 641
PAL_FREQ_RATIOS PAL_FREQ_RATIOS – Get Processor Frequency Ratios (14) Returns the ratios of the processor frequency, bus frequency, and interval timer to the Purpose: input clock of the processor, if the platform clock is generated externally or to the output clock to the platform, if the platform clock is generated by the processor. Static Registers Only Calling Conv: Physical or Virtual...
Page 642
PAL_GET_HW_POLICY PAL_GET_HW_POLICY – Retrieve Current Hardware Resource Sharing Policy (48) Returns the current hardware resource sharing policy of the processor. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_GET_HW_POLICY within the list of PAL procedures. proc_num Unsigned 64-bit integer that specifies for which logical processor information is being requested.
Page 643
PAL_GET_HW_POLICY Table 11-80. Hardware policies returned in cur_policy Value Name Description Performance The processor has its hardware resources configured to achieve maximum performance across all logical processors that share hardware with the logical processor the procedure was made on. Fairness The processor has its hardware resources configured to approximately achieve equal sharing of competing hardware resources among all the logical processors that share hardware...
Page 644
PAL_GET_PSTATE PAL_GET_PSTATE – Return Information on the Performance Index of the Processor (262) Returns the performance index of the processor. Purpose: Stacked Registers Calling Conv: Physical and Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_GET_PSTATE within the list of PAL procedures. type Type of performance_index value to be returned by this procedure.
Page 645
PAL_GET_PSTATE Table 11-81. PAL_GET_PSTATE type Argument type Description The performance_index returned will correspond to the target P-state requested by software. • For SCDD (software-coordinated dependency domain) logical processors, this is the P-state requested by the most recent PAL_SET_PSTATE procedure call made by any logical processor in the domain.
Page 646
PAL_GET_PSTATE type=2, the procedure will return the performance_index value corresponding to the processor performance in the time duration between the previous call to PAL_GET_PSTATE with type=1 and the current call. If the processor had transitioned to a HALT state (see Section 11.6.1, “Power/Performance States (P-states)”...
Page 647
PAL_HALT PAL_HALT – Halt Processor (28) Causes the processor to enter the HALT state, or one of the implementation-dependent Purpose: low-power states. Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_HALT within the list of PAL procedures. halt_state Unsigned 64-bit integer denoting low power state requested.
Page 648
PAL_HALT • I/O type is an unsigned 8-bit integer denoting the type of I/O transaction to complete. Table 11-83. I/O Type Definition Value Description No transaction Perform a load Perform a store All other values for I/O type are reserved. •...
Page 649
PAL_HALT_INFO PAL_HALT_INFO – Get Halt State Information for Power Management (257) Returns information about the processor’s power management capabilities. Purpose: Stacked Registers Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_HALT_INFO within the list of PAL procedures. power_buffer 64-bit pointer to a 64-byte buffer aligned on an 8-byte boundary.
Page 650
PAL_HALT_INFO The latency numbers given are the minimum number of processor cycles that will be required to transition the states. The maximum or average cannot be determined by PAL due to its dependency on outstanding bus transactions. For more information on power management, please refer to Section 11.6, “Power Management”...
Page 651
PAL_HALT_LIGHT PAL_HALT_LIGHT – Cause Processor to Enter Coherent Halt State (29) Causes the processor to enter the LIGHT HALT state, where prefetching and execution Purpose: are suspended, but cache and TLB coherency is maintained. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer:...
Page 652
PAL_LOGICAL_TO_PHYSICAL PAL_LOGICAL_TO_PHYSICAL – Get Information on Logical to Physical Processor Mappings (42) Returns information on the logical to physical processor mapping. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_LOGICAL_TO_PHYSICAL within the list of PAL procedures. proc_number Signed 64-bit integer that specifies for which logical processor information is being requested.
Page 655
PAL_MC_CLEAR_LOG PAL_MC_CLEAR_LOG – Clear Processor Error Logging Registers (21) Clears all processor error logging registers and resets the indicator that allows the error Purpose: logging registers to be written. This procedure also checks the pending machine check bit and pending INIT bit and reports their states. Static Registers Only Calling Conv: Physical and Virtual...
Page 656
PAL_MC_DRAIN PAL_MC_DRAIN – Complete Outstanding Transactions (22) Ensures that all outstanding transactions in a processor are completed or that any MCA Purpose: due to these outstanding transactions is taken. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument...
Page 657
PAL_MC_DYNAMIC_STATE PAL_MC_DYNAMIC_STATE – Returns Dynamic Processor State (24) Returns the Machine Check Dynamic Processor State. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_MC_DYNAMIC_STATE within the list of PAL procedures. info_type Unsigned 64-bit value indicating the type of information to return dy_buffer...
Page 658
PAL_MC_ERROR_INFO PAL_MC_ERROR_INFO – Get Processor Error Information (25) Returns the Processor Machine Check Information Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_MC_ERROR_INFO within the list of PAL procedures. info_index Unsigned 64-bit integer identifying the error information that is being requested.
Page 659
PAL_MC_ERROR_INFO Table 11-86. info_index Values info_index Error Information Type Description Processor Error Map This info_index value will return the processor error map. This return value specifies the processor core identification, the processor thread identification, and a bit-map indicating which structure(s) of the processor generated the machine check.
Page 660
PAL_MC_ERROR_INFO Figure 11-19. level_index Layout 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 rsvd Table 11-87. level_index Fields Field Bits Description Processor core ID (default is 0 for processors with a single core) Logical thread ID (default is 0 for processors that execute a single thread) 11:8 Error information is available for 1st, 2nd, 3rd, and 4th level instruction caches...
Page 661
PAL_MC_ERROR_INFO Table 11-88. err_type_index Values (Continued) err_type_index Return Value Description value mod 8 Responder identifier The responder identifier is a 64-bit integer that specifies the bus agent that responded to a transaction that was responsible for generating the machine check. The structure-specific error information informs the caller if there is a valid responder identifier.
Page 662
PAL_MC_ERROR_INFO instruction pointer available for logging on the second error. If there is, it makes sub-sequent calls with err_type_index equal to 9, 10, 11, and/or 12 depending on which valid bits are set. The caller continues incrementing the err_type_index value in this fashion until the inc_err_type return value is zero.
Page 663
Reserved Instruction set. If this value is set to zero, the instruction that generated the machine check was an Intel Itanium instruction. If this bit is set to one, the instruction that generated the machine check was IA-32 instruction. The is field in the cache_check parameter is valid.
Page 664
Reserved Instruction set. If this value is set to zero, the instruction that generated the machine check was an Intel Itanium instruction. If this bit is set to one, the instruction that generated the machine check was IA-32 instruction. The is field in the TLB_check parameter is valid.
Page 665
Reserved Instruction set. If this value is set to zero, the instruction that generated the machine check was an Intel Itanium instruction. If this bit is set to one, the instruction that generated the machine check was IA-32 instruction. The is field in the bus_check parameter is valid.
Page 666
Reserved Instruction set. If this value is set to zero, the instruction that generated the machine check was an Intel Itanium instruction. If this bit is set to one, the instruction that generated the machine check was IA-32 instruction. The is field in the reg_file_check parameter is valid.
Page 667
PAL_MC_ERROR_INFO Table 11-93. reg_file_check Fields Field Bits Description 57:56 Privilege level. The privilege level of the instruction bundle responsible for generating the machine check. The pl field of the reg_file_check parameter is valid. Machine check corrected: This bit is set to one to indicate that the machine check has been corrected.
Page 668
Reserved Instruction set. If this value is set to zero, the instruction that generated the machine check was an Intel Itanium instruction. If this bit is set to one, the instruction that generated the machine check was IA-32 instruction. The is field in the bus_check parameter is valid.
Page 669
PAL_MC_ERROR_INJECT PAL_MC_ERROR_INJECT – Inject Processor Error (276) Injects the requested processor error or returns information on the supported injection Purpose: capabilities for this particular processor implementation. Stacked Calling Conv: Physical and Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_MC_ERROR_INJECT within the list of PAL procedures. err_type_info Unsigned 64-bit integer specifying the first level error information which identifies the error structure and corresponding structure hierarchy, and the error severity.
Page 670
PAL_MC_ERROR_INJECT Table 11-95. err_type_info Field Bits Description mode Indicates the mode of operation for this procedure: 0 – Query mode 1 – Error inject mode (err_inj should also be specified) 2 – Cancel outstanding trigger. All other fields in err_type_info, err_struct_info and err_data_buffer are ignored.
Page 671
PAL_MC_ERROR_INJECT supported for error injection. The caller is required to use the query mode with appropriate inputs in err_struct_info to determine which combinations of error injection types are supported. If a given combination is not supported, the procedure returns with status -5. The procedure supports both an Error inject and Error inject and consume mode (selectable through the err_inj field in err_type_info).
Page 672
PAL_MC_ERROR_INJECT Table 11-96. resources Return Value Field Bits Description ibr0 When 1, indicates that IBR0,1 are being used by the procedure for trigger functionality. ibr2 When 1, indicates that IBR2,3 are being used by the procedure for trigger functionality. ibr4 When 1, indicates that IBR4,5 are being used by the procedure for trigger functionality.
Page 673
PAL_MC_ERROR_INJECT Table 11-97. err_struct_info – Cache (Continued) Field Bits Description cl_id Indicates which mechanism is used to identify the cache line to be used for error injection: 0 – Reserved 1 – Virtual address provided in the inj_addr field of the buffer pointed to by err_data_buffer should be used to identify the cache line for error injection.
Page 674
PAL_MC_ERROR_INJECT Table 11-98. capabilities vector for cache (Continued) Field Bits Description Error injection in tag portion of cache line is supported data Error injection in data portion of cache line is supported mesi Error injection in mesi portion of cache line is supported Error injection that results in data poisoning events is supported Reserved Reserved...
Page 677
PAL_MC_ERROR_INJECT Table 11-103. err_struct_info – Register File Field Bits Description When 1, indicates that the structure information fields (regfile_id, reg_num) are valid and should be used for error injection. When 0, the structure information fields are ignored, and the values of these fields used for error injection are implementation-specific. regfile_id Identifies the register file where the error should be injected: 0 –...
Page 680
PAL_MC_HW_TRACKING PAL_MC_HW_TRACKING – Query which hardware structures are performing hardware status tracking (51) Provide a way to query which hardware structures are performing hardware status Purpose: tracking for corrected machine check events. Static Registers Only Calling Conv: Physical and Virtual Mode: Dependent Buffer:...
Page 681
PAL_MC_HW_TRACKING The convention for the levels in the hw_track field is such that the least significant bit in the field represents the lowest level of the structures hierarchy. For example, bit 0 of the ICT field represents the first level instruction cache. Volume 2, Part 1: Processor Abstraction Layer 2:433...
Page 682
PAL_MC_EXPECTED PAL_MC_EXPECTED – Set/Reset Expected Machine Check Indicator (23) Informs PALE_CHECK whether a machine check is expected so that PALE_CHECK will Purpose: not attempt to correct any expected machine checks. Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description...
Page 683
PAL_MC_REGISTER_MEM PAL_MC_REGISTER_MEM – Register Memory with PAL for Machine Check and Init (27) Registers a platform dependent location with PAL to which it can save minimal Purpose: processor state in the event of a machine check or initialization event. Static Registers Only Calling Conv: Physical Mode:...
Page 684
PAL_MC_RESUME PAL_MC_RESUME – Restore Minimal Architected State and Return (26) Restores the minimal architectural processor state, sets the CMC interrupt if necessary, Purpose: and resumes execution. Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_MC_RESUME within the list of PAL procedures.
Page 685
PAL_MEM_ATTRIB PAL_MEM_ATTRIB – Get Memory Attributes (5) Returns the memory attributes implemented by processor. Purpose: Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_MEM_ATTRIB within the list of PAL procedures. Reserved Reserved Reserved...
Page 686
PAL_MEMORY_BUFFER PAL_MEMORY_BUFFER – Allocate a cacheable memory buffer for exclusive PAL usage (277) Provides cacheable memory to PAL for exclusive use during runtime. Purpose: Stacked Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_MEMORY_BUFFER within the list of PAL procedures. base_address Physical address of the memory buffer allocated for PAL use.
Page 687
PAL_MEMORY_BUFFER A memory buffer must be allocated for each physical package, and is shared by all logical processors on that package. Software is required to call this procedure on all logical processors on a given package with the same input values. If not, processor operation is undefined.
Page 688
PAL_PERF_MON_INFO PAL_PERF_MON_INFO – Get Processor Performance Monitor Information (15) Returns Performance Monitor information about what can be counted and how to Purpose: configure the monitors to count the desired events. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument...
Page 689
PAL_PERF_MON_INFO Table 11-111. pm_buffer Layout (Continued) Offset Description 0x40 256-bit mask defining which registers can count cycles. 0x60 256-bit mask defining which registers can count retired bundles. Volume 2, Part 1: Processor Abstraction Layer 2:441...
Page 690
PAL_PLATFORM_ADDR PAL_PLATFORM_ADDR – Set Processor Interrupt Block Address and I/O Port Space Address (16) Specifies the physical address of the processor Interrupt Block and I/O Port Space. Purpose: Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description...
Page 691
PAL_PMI_ENTRYPOINT PAL_PMI_ENTRYPOINT – Setup SAL PMI Entrypoint in Memory (32) Sets the SAL PMI entrypoint in memory. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_PMI_ENTRYPOINT within the list of PAL procedures. SAL_PMI_entry 256-byte aligned physical address of SAL PMI entrypoint in memory.
Page 692
PAL_PREFETCH_VISIBILITY PAL_PREFETCH_VISIBILITY – Make Processor Prefetches Visible (41) Used in the architected sequences for memory attribute transitions described in Purpose: Section 4.4.11, “Memory Attribute Transition” on page 2:88 to transition a page (or set of pages) from a one memory attribute to another. Static Registers Only Calling Conv: Physical and Virtual...
Page 693
PAL_PREFETCH_VISIBILITY This procedure, when used to delete a memory range on-line, will ensure that all of the conditions described in both of the preceding paragraphs regarding transition of virtual memory attributes and physical memory attributes are met. If the processor implementation does not require this procedure call to be made on remote processors in the sequences, this procedure will return a 1 upon successful completion.
Page 694
PAL_PROC_GET_FEATURES PAL_PROC_GET_FEATURES – Get Processor Dependent Features (17) Provides information about configurable processor features. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_PROC_GET_FEATURES within the list of PAL procedures. Reserved feature_set Feature set information is being requested for.
Page 695
PAL_PROC_GET_FEATURES the feature may optionally be controllable, and No indicates that the feature cannot be controllable. The control field applies only when the feature is available. The sense of the bits is chosen so that for features which are controllable, the default hand-off value at exit from PALE_RESET should be 0.
Page 696
PAL_PROC_GET_FEATURES Table 11-112. Processor Features (Continued) Class Control Scope Description Opt. Req. Enable the use of the vmsw instruction. When 0, the vmsw instruction causes a Virtualization fault when executed at the most privileged level. When 1, this bit will enable normal operation of the vmsw instruction. This bit has no effect if virtual machine features are disabled (see bit 40).
Page 697
PAL_PROC_GET_FEATURES Table 11-112. Processor Features (Continued) Class Control Scope Description Opt. Opt. Virtual Machine features implemented and enabled. When 1, PSR.vm is implemented and virtual machines features are not disabled. When 0 (features_status) and when the corresponding features_avail bit is 1, virtual machines features are implemented but are disabled.
Page 698
PAL_PROC_SET_FEATURES PAL_PROC_SET_FEATURES – Set Processor Dependent Features (18) Enables/disables specific processor features. Purpose: Static Registers Only Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_PROC_SET_FEATURES within the list of PAL procedures. feature_select 64-bit vector denoting desired state of each feature (1=select, 0=non-select). feature_set Feature set to apply changes to.
Page 699
PAL_PSTATE_INFO PAL_PSTATE_INFO – Get Information for Power/Performance States (44) Returns information about the P-states supported by the processor. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_PSTATE_INFO within the list of PAL procedures. pstate_buffer 64-bit pointer to a 256-byte buffer aligned on an 8-byte boundary.
Page 700
PAL_PSTATE_INFO performance in the P0 state. For example, if the P1-state has a value of 75, and the next P-state (P2) has a value of 50, it implies that P1 performance is 25% lower than P0 performance, and P2 performance is 50% lower than P0 performance. •...
Page 701
PAL_PTCE_INFO PAL_PTCE_INFO – Get PTCE Purge Loop Information (6) Returns information required for the architected loop used to purge (initialize) the Purpose: entire TC. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_PTCE_INFO within the list of PAL procedures.
Page 702
PAL_REGISTER_INFO PAL_REGISTER_INFO – Return Information about Implemented Processor Registers (39) Returns information about implemented Application and Control Registers. Purpose: Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_REGISTER_INFO within the list of PAL procedures. info_request Unsigned 64-bit integer denoting what register information is requested.
Page 703
PAL_RSE_INFO PAL_RSE_INFO – Get RSE Information (19) Returns information about the register stack and RSE for this processor Purpose: implementation. Static Registers Only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_RSE_INFO within the list of PAL procedures. Reserved Reserved Reserved...
Page 704
PAL_SET_HW_POLICY PAL_SET_HW_POLICY – Set Current Hardware Resource Sharing Policy (49) Sets the current hardware resource sharing policy of the processor. Purpose: Static Registers Only Calling Conv: Physical and Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_SET_HW_POLICY within the list of PAL procedures. policy Unsigned 64-bit integer specifying the hardware resource sharing policy the caller is setting.
Page 705
PAL_SET_HW_POLICY Table 11-116. Processor Hardware Sharing Policies (Continued) Value Name Description High-priority The processor configures hardware resources to provide the logical processor this procedure was called on a greater share of the competing hardware resources. All competing logical processors will get a smaller share of the competing hardware resources. Exclusive High-priority The processor configures hardware resources such that the logical processor this procedure was called on has a greater share of the competing hardware resources.
Page 706
PAL_SET_PSTATE PAL_SET_PSTATE – Request Processor to Enter Power/Performance State (263) To request a processor transition to a given P-state. Purpose: Stacked Registers Calling Conv: Physical and Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_SET_PSTATE within the list of PAL procedures. p_state Unsigned integer denoting the processor P-state being requested.
Page 707
PAL_SET_PSTATE coordination. A subsequent call to PAL_SET_PSTATE on any logical processor in the dependency domain (with a force_pstate argument of zero) reinstates hardware coordination. The force_pstate argument is ignored on SCDD and HIDD logical processors. Calling this procedure on some processor implementations may affect P-states of other processors in the same dependency domain.
Page 708
PAL_SHUTDOWN PAL_SHUTDOWN – Shutdown the Processor (45) Put the logical processor into a low power state which can be exited only by a reset Purpose: event. Static Registers Only Calling Conv: Physical Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_SHUTDOWN within the list of PAL procedures. notify_platform 8-byte aligned physical address pointer providing details on how to optionally notify the platform that the processor is entering a shutdown state.
Page 709
PAL_TEST_INFO PAL_TEST_INFO – Information for Processor Self-test (37) Returns the alignment and size requirements needed for the memory buffer passed to Purpose: the PAL_TEST_PROC procedure as well as information on self-test control words for the processor self-tests. Static Registers Only Calling Conv: Physical Mode:...
Page 710
PAL_TEST_PROC PAL_TEST_PROC – Perform a Processor Self-test (258) Performs the second phase of processor self test. Purpose: Stacked Registers Calling Conv: PAL_TEST_PROC may modify some registers marked unchanged in the Stacked Register calling convention. See additional description below. Physical Mode: Not dependent Buffer: Arguments:...
Page 711
PAL_TEST_PROC • test_phase defines which phase of the processor self-tests are requested to be run. A value of zero indicates to run phase two of the processor self-tests. Phase two of the processor self-tests are ones that require external memory to execute correctly. A value of one indicates to run phase one of the processor self-tests.
Page 712
PAL_TEST_PROC with the exception of the translation caches, which are evicted as a result of testing. PAL_TEST_PROC is free to invalidate all cache contents. If the caller depends on the contents of the cache, they should be flushed before making this call. PAL_TEST_PROC requires that the RSE is set up properly to handle spills and fills to a valid memory location if the contents of the register stack are needed.
Page 713
PAL_VERSION PAL_VERSION – Get PAL Version Number Information (20) Returns PAL version information. Purpose: Static registers only Calling Conv: Physical or Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VERSION within the list of PAL procedures. Reserved Reserved Reserved Returns:...
Page 714
PAL_VM_INFO PAL_VM_INFO – Get Virtual Memory Information (7) Return information about the virtual memory characteristics of the processor Purpose: implementation. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VM_INFO within the list of PAL procedures. tc_level Unsigned 64-bit integer specifying the level in the TLB hierarchy for which information is required.
Page 715
PAL_VM_PAGE_SIZE PAL_VM_PAGE_SIZE – Get Virtual Memory Page Size Information (34) Returns page size information about the virtual memory characteristics of the processor Purpose: implementation. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VM_PAGE_SIZE within the list of PAL procedures.
Page 716
PAL_VM_SUMMARY PAL_VM_SUMMARY – Get Virtual Memory Summary Information (8) Returns summary information about the virtual memory characteristics of the processor Purpose: implementation. Static Registers Only Calling Conv: Physical and Virtual Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VM_SUMMARY within the list of PAL procedures. Reserved Reserved Reserved...
Page 718
PAL_VM_TR_READ PAL_VM_TR_READ – Read a Translation Register (261) Reads a translation register. Purpose: Stacked Registers Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VM_TR_READ within the list of PAL procedures. reg_num Unsigned 64-bit number denoting which TR to read. tr_type Unsigned 64-bit number denoting whether to read an ITR (0) or DTR (1).
Page 719
PAL_VP_CREATE PAL_VP_CREATE – PAL Create New Virtual Processor (265) Initializes a new vpd for the operation of a new virtual processor in the virtual Purpose: environment. Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_CREATE within the list of PAL procedures 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) host_iva 64-bit host virtual pointer to the host IVT for the virtual processor...
Page 720
PAL_VP_CREATE This procedure returns unimplemented procedure when virtual machine features are disabled. See Section 3.4, “Processor Virtualization” on page 2:44 “PAL_PROC_GET_FEATURES – Get Processor Dependent Features (17)” on page 2:446 for details. 2:472 Volume 2, Part 1: Processor Abstraction Layer...
Page 721
PAL_VP_ENV_INFO PAL_VP_ENV_INFO – PAL Virtual Environment Information (266) Returns the parameters needed to enter a virtual environment. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_ENV_INFO within the list of PAL procedures Reserved Reserved Reserved Returns:...
Page 722
PAL_VP_ENV_INFO Table 11-118. vp_env_info – Virtual Environment Information Parameter Field Description Reserved 31:11 Reserved probe If 1, processor supports interception of probe instructions. See Section 11.7.4.2.8, “Probe Instruction Virtualization” on page 2:344 for details on the usage of this control. If 0, intercept of probe instructions is not supported.
Page 723
PAL_VP_EXIT_ENV PAL_VP_EXIT_ENV – PAL Exit Virtual Environment (267) Allows a logical processor to exit a virtual environment. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_EXIT_ENV within the list of PAL procedures Optional 64-bit host virtual pointer to the IVT when this procedure is done Reserved Reserved Returns:...
Page 724
PAL_VP_INFO PAL_VP_INFO – PAL Virtual Processor Information (50) Returns information about virtual processor features. Purpose: Static Calling Conv: Physical Mode: Not dependent Buffer: Arguments: Argument Description index Index of PAL_VP_INFO within the list of PAL procedures feature_set Feature set information is being requested for. vp_buffer An address to an 8-byte aligned memory buffer (if used).
Page 725
PAL_VP_INFO get the vmm_id, although vmm_id is also returned for any other implemented feature sets as well. For feature_set 0, the vp_buffer argument is ignored. Volume 2, Part 1: Processor Abstraction Layer 2:477...
Page 726
PAL_VP_INIT_ENV PAL_VP_INIT_ENV – PAL Initialize Virtual Environment (268) Allows a logical processor to enter a virtual environment. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_INIT_ENV within the list of PAL procedures config_options 64-bit vector of global configuration settings –...
Page 727
PAL_VP_INIT_ENV processors in the virtual environment must specify the same value in the config_options parameter during PAL_VP_INIT_ENV, otherwise processor operation is undefined. Table 11-119. config_options – Global Configuration Options Field Description Global initialize If 1, this procedure will initialize the PAL virtual environment buffer for Configuration this virtual environment.
Page 728
PAL_VP_INIT_ENV Table 11-119. config_options – Global Configuration Options (Continued) Field Description Global opcode This bit must be set to 1 – opcode information will be provided to the Virtualization VMM during PAL intercepts within the virtual environment. This opcode Optimizations may or may not be guaranteed to be the opcode that triggered the intercept.
Page 729
PAL_VP_REGISTER PAL_VP_REGISTER – PAL Register Virtual Processor (269) Register a different host IVT and/or a different optional virtualization intercept handler Purpose: for the virtual processor specified by vpd. Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_REGISTER within the list of PAL procedures 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) host_iva...
Page 730
PAL_VP_REGISTER • Relocate the host IVT associated with the virtual processor. • Specify a different optional virtualization intercept handler for the virtual processor. This procedure returns unimplemented procedure when virtual machine features are disabled. See Section 3.4, “Processor Virtualization” on page 2:44 “PAL_PROC_GET_FEATURES –...
Page 731
PAL_VP_RESTORE PAL_VP_RESTORE – PAL Restore Virtual Processor (270) Restores virtual processor state for the specified vpd on the logical processor. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_RESTORE within the list of PAL procedures. 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD.) Reserved Reserved...
Page 732
PAL_VP_SAVE PAL_VP_SAVE – PAL Save Virtual Processor (271) Saves virtual processor state for the specified vpd on the logical processor. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_SAVE within the list of PAL procedures 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) Reserved Reserved...
Page 733
PAL_VP_TERMINATE PAL_VP_TERMINATE – PAL Terminate Virtual Processor (272) Terminates operation for the specified virtual processor. Purpose: Stacked Registers Calling Conv: Virtual Mode: Dependent Buffer: Arguments: Argument Description index Index of PAL_VP_TERMINATE within the list of PAL procedures 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) Optional 64-bit host virtual pointer to the IVT when this procedure is done Reserved Returns:...
Page 734
11.11 PAL Virtualization Services In order to support efficient handling of interruptions when PSR.vm was 1, a set of PAL virtualization services is defined to allow certain high-frequency PAL functions to be performed in a low-latency and low-overhead manner. Upon successful completion of PAL_VP_INIT_ENV, the virtual base address of the PAL virtualization services (VSA) is returned to the VMM.
Page 735
Table 11-121. State Requirements for PSR for PAL Virtualization Services PSR Bit Description Value big-endian memory access enable user performance monitor enable alignment check floating-point registers f2-f31 written floating-point registers f32-f127 written interruption state collection enable interrupt enable protection key validation enable data address translation enable disabled FP register f2 to f31 disabled FP register f32 to f127...
Page 736
c. Specific PAL services can be invoked with PSR.ic equal to 1 or 0. See the description of specific PAL services for details. d. Most PAL services can be invoked with PSR.bn equal to 1 or 0. e. Specific PAL services must be invoked with PSR.bn equal to 0. See the description of specific PAL services for details.
Page 737
PAL_VPS_RESUME_NORMAL PAL_VPS_RESUME_NORMAL – Resume Virtual Processor Normal (0x0000) Resumes the current virtual processor. This service is used when vpsr.ic is 1. This Purpose: service can also be used independent of the state of vpsr.ic if all virtualization accelerations and disables are disabled. Arguments: Argument Description...
Page 738
PAL_VPS_RESUME_NORMAL Table 11-122. Virtual Processor Settings in Architectural Resources for PAL_VPS_RESUME_NORMAL and PAL_VPS_RESUME_HANDLER Resource Description External Interrupt Control The external interrupt control registers contain the state of the virtual Registers processor if d_extint in Virtualization Disable Control (vdc) is 1. Otherwise the external interrupt control registers are virtualized by the VMM and contain VMM state.
Page 739
PAL_VPS_RESUME_NORMAL Table 11-123. Processor Status Register Settings for Virtual Processor Execution (Continued) Field Bits Description 33:32 Contains the cpl field of the virtual processor. VMM-specific. VMM-specific. Must be 1. VMM-specific. VMM-specific. VMM-specific. VMM-specific. 42:41 Contains the ri field of the virtual processor. Contains the ed bit of the virtual processor.
Page 740
PAL_VPS_RESUME_HANDLER PAL_VPS_RESUME_HANDLER – Resume Virtual Processor Handler (0x0400) Resumes the current virtual processor. This service is used when vpsr.ic is 0. Purpose: Arguments: Argument Description GR24 VBR0 GR25 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) GR26 Virtualization Acceleration Control (vac) field from the VPD specified in GR25 and CFLE setting at the target instruction.
Page 741
PAL_VPS_SYNC_READ PAL_VPS_SYNC_READ – Synchronize VPD State for Reads (0x0800) Synchronize VPD with the latest implementation-specific virtual architectural state. Purpose: Arguments: Argument Description GR24 64-bit host virtual return address GR25 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) GR26 Reserved GR27 Reserved...
Page 742
PAL_VPS_SYNC_WRITE PAL_VPS_SYNC_WRITE – Synchronize VPD State for Writes (0x0c00) Synchronize the implementation-specific virtual architectural state with VPD. Purpose: Arguments: Argument Description GR24 64-bit host virtual return address. GR25 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD.) GR26 Reserved GR27 Reserved GR28...
Page 744
PAL_VPS_SET_PENDING_INTERRUPT PAL_VPS_SET_PENDING_INTERRUPT performs the following actions: • Copy the virtual highest priority pending interrupt from the VPD into implementation-specific resources. • Return to VMM by an indirect branch specified in the GR24 parameter. 2:496 Volume 2, Part 1: Processor Abstraction Layer...
Page 745
PAL_VPS_THASH PAL_VPS_THASH – Compute Long Format VHPT Entry Address (0x1400) Compute a long format VHPT entry address. Purpose: Arguments: Argument Description GR24 64-bit host virtual return address GR25 64-bit virtual address used to compute the hash entry address GR26 Region register value used to compute the hash entry address GR27 Virtual PTA GR28...
Page 746
PAL_VPS_TTAG PAL_VPS_TTAG – Compute Translated Hashed Entry Tag (0x1800) Compute the long format translated hashed entry tag. Purpose: Arguments: Argument Description GR24 64-bit host virtual return address GR25 64-bit virtual address used to compute the hash entry tag GR26 Region register value used to compute the hash entry tag GR27 Reserved GR28...
Page 747
PAL_VPS_RESTORE PAL_VPS_RESTORE – Fast Restore Virtual Processor State (0x1c00) Performs an implementation-specific light-weight restore operation for the specified Purpose: VPD on the logical processor. Arguments: Argument Description GR24 64-bit host virtual return address GR25 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) GR26 Skip implicit synchronization GR27...
Page 748
PAL_VPS_SAVE PAL_VPS_SAVE – Fast Save Virtual Processor State (0x2000) Performs an implementation-specific light-weight save operation for the specified VPD Purpose: on the logical processor. Arguments: Argument Description GR24 64-bit host virtual return address GR25 64-bit host virtual pointer to the Virtual Processor Descriptor (VPD) GR26 Skip implicit synchronization GR27...
Page 749
Part II: System Programmer’s Guide 2:501 Intel® Itanium Architecture Software Developer’s Manual, Rev. 2.3...
Page 751
About the System Programmer’s Guide Part II: System Programmer’s Guide is intended as a companion section to the information presented in Part I:, “System Architecture Guide”. While Part I provides a crisp and concise architectural definition of the Itanium instruction set, Part II provides insight into programming and usage models of the Itanium system architecture.
Page 752
Chapter 4, “Context Management” describes how operating systems need to preserve Itanium register contents. In addition to spilling and filling a register’s data value, the Itanium architecture also requires software to preserve control and data speculative state associated with that register, i.e. its NaT bit and ALAT state. This chapter also discusses system architecture mechanisms that allow an operating system to significantly reduce the number of registers that need to be spilled/filled on interruptions, system calls, and context switches.
Page 753
This chapter is of interest to platform firmware and operating system developers. Related Documents The following documents are referred to fairly often in this document. For more details on software conventions and platform firmware, please consult these manuals (available at http://developer.intel.com). ® ® [SWC] Intel Itanium...
Page 754
2:506 Volume 2, Part 2: About the System Programmer’s Guide...
Page 755
This chapter closes by describing how to correctly update code images to implement self-modifying code, cross-modifying code, and paging of code using programmed I/O. ® ® An Overview of Intel Itanium Memory Access Instructions The Itanium architecture provides load, store, and semaphore instructions to access memory.
Page 756
• Fence semantics combine acquire and release semantics (i.e. the instruction is made visible after all prior orderable instructions and before all subsequent orderable instructions). In the above definitions “prior” and “subsequent” refer to the program-specified order. An “orderable instruction” is an instruction that the memory ordering model can use to establish ordering relationships .
Page 757
specific opcode chosen. The xchg instruction always has acquire semantics. These instructions read a value from memory, modify this value using an instruction-specific operation, and then write the modified value back to memory. The read-modify-write sequence is atomic by definition. 2.1.3.1 Considerations for using Semaphores The memory location on which a semaphore instruction operates on must obey two...
Page 758
® ® Memory Ordering in the Intel Itanium Architecture Understanding a system’s memory ordering model is key to writing either user- or...
Page 759
In the Itanium architecture, dependencies between operations by a processor have implications for the ordering of those operations at that processor. The discussion in Section 2.2.1.6 page 2:515 Section 2.2.1.7 page 2:516 explores this issue in greater depth. The following sections examine the Itanium ordering model in detail. Section 2.2.1 presents several memory ordering executions to illustrate important behaviors of the model.
Page 760
“X” and “Y” indicate any orderable instruction. ® ® 2.2.1.2 The Intel Itanium Architecture Provides a Relaxed Ordering Model The Itanium memory ordering model is a relaxed model. As a result, the Itanium architecture permits any outcome when executing the code shown in Table 2-1.
Page 761
Processor #0 operations M1 and M2 and the Processor #1 operations M3 and M4 from Table 2-1 execution as shown in Table 2-1. ® ® Table 2-2. Acquire and Release Semantics Order Intel Itanium Memory Operations Processor #0 Processor #1 [x] = 1 // M1 ld.acq...
Page 762
The Itanium ordering semantics always allow a processor to make operations that follow a release visible before the release and to make operations that precede an acquire visible after the acquire. Table 2-3. Loads May Pass Stores to Different Locations Processor #0 Processor #1 st.rel...
Page 763
This contradicts the postulated outcome r1 = 0 and r2 = 0 and thus the Itanium memory ordering model disallows the r1 = 1 and r2 = 0 outcome. Specifically, if M3 reads 0, then M4, M5, and M6 may not yet be visible but M1 and M2 must be visible. Thus, when M6 becomes visible it must observe x = 1 because M1 is already visible.
Page 764
2.2.1.7 Data Dependency Establishes Local Ordering In the Itanium architecture, a dependency (e.g., a later operation reading the value written by an earlier operation) can imply a local ordering relationship between the two operations. This section focuses on dependencies through registers only. Section 2.2.1.6 discusses dependencies and MP ordering.
Page 765
The Itanium architecture does not allow the outcome r1 = x and r2 = 0 in this execution either. Unlike the execution in Table 2-6, there is no direct dependency between the values that M3 produces and the values that M4 consumes. However, there is a RAW through register r1 from M3 to C1 and a RAW through register p1 from C1 to M4.
Page 766
2.2.1.8 Store Buffers May Satisfy Local Loads In the Itanium memory ordering model, store buffers (or other logically-equivalent structures) may satisfy local read requests from loads or acquire loads even if the stored data is not yet visible to other agents in the coherence domain. Such bypassing must honor any ordering semantics in the memory reference stream.
Page 767
to account for both the memory ordering semantics and dependencies. It is important to keep in mind that the observance of a dependency between two operations does not imply an ordering relationship (from the standpoint of the memory ordering model) between the operations as Section 2.2.1.6 describes.
Page 768
Like Section 2.2.1.8, the discussion in this section focuses on the outcome r1 = 1, r3 = 1, r2 = 0, and r4 = 0 because it is allowed if and only if store buffers can satisfy local loads. The line of reasoning to show that the outcome r1 = 1, r3 = 1, r2 = 0, and r4 = 0 is not allowed in Table 2-11 is similar to the reasoning used to show that this outcome...
Page 770
A store buffer may not provide a local read operation early access to a value written by a semaphore operation. Therefore, the outcome r1 = 1, r3 = 1, r2 = 0, r4 = 0, r5 = 0, and r6 = 0 in the Table 2-13 execution is not allowed.
Page 771
The fact that the store to x is a release store implies that, since there is a causal relationship from M1 to M3, M1 must become visible to processor #2 before M3. ® ® Table 2-15. Intel Itanium Architecture Obeys Causality Processor #0 Processor #1 Processor #2 st.rel [x] = 1 // M1...
Page 772
2.2.2 Memory Attributes In addition to the ordering semantics and data dependencies, the memory attributes of the page that is being referenced also influence access ordering and visibility. Using memory attributes allows the Itanium architecture to match the performance and the usage model to the type of device (e.g.
Page 773
2.2.3 Understanding Other Ordering Models: Sequential Consistency and IA-32 To provide a point of reference, it is helpful to understand other memory ordering models. These ordering models affect not only the programmer’s view of the system, but also the overall system performance and design. Processors with relaxed memory ordering models may achieve higher performance than those with strict ordering models.
Page 774
For example, consider the example shown in Figure 2-3. ® Figure 2-3. Why a Fence During Context Switches is Required in the Intel ® Itanium Architecture // Process A begins executing on Processor #0... ld.acq...
Page 775
2.4.1 Spin Lock Software commonly uses spin locks to guard access to a critical region of code. In these locks, the software “spins” while waiting for a shared lock variable to indicate that the critical region can be safely accessed. Typically, the lock code uses atomic operations such as compare and exchange or fetch and add to update the shared lock variable.
Page 776
2.4.2 Simple Barrier Synchronization A barrier is a common synchronization primitive used to hold a set of processes at a particular point in the program (the barrier) until all processors reach the location. Once all processes arrive at the barrier, they may all continue to execute. Figure 2-5 shows a sense-reversing barrier synchronization based on the fetchadd instruction from Hennessy and Patterson [HP96].
Page 777
indicates the value that release must have before the processor can leave the barrier. The last processor to arrive at the barrier releases the other processors by setting release to the new local_sense value. The mf instruction in Figure 2-5 is necessary only if the programmer wishes to ensure that memory operations performed before the barrier code are visible to memory operations performed by any processor after the barrier code.
Page 778
Figure 2-6. Dekker’s Algorithm in a 2-way System // The flag_me variable is zero if we are not in the // synchronization and critical section code and non-zero // otherwise; flag_you is similarly set for the other processor. // This algorithm does not retry access to the // resource if there is contention.
Page 779
Figure 2-7. Lamport’s Algorithm // The proc_id variable holds a unique, non-zero id for the process that // attempts access to the critical section. x and y are the synchronization // variables that indicate who is in the critical section and who is // attempting entry.
Page 780
• Programmed I/O for paging of code pages. • DMA for paging of code pages. The next four sections discuss these techniques in greater depth. To illustrate the code sequences for self- and cross-modifying code, the examples in this section use the syntax “st [foo] = new” to represent a group of aligned stores that change the instruction at address foo to the instruction “new”.
Page 781
2.5.2 Cross-modifying Code Consider a multi-threaded program for a multiprocessor system that dynamically updates some procedure that any processor in the system may execute. The program maintains several disjoint buffers to hold the new code and requires a processor to execute an IP-relative branch instruction at some address x to reach the code.
Page 782
The release store ensures that the code image updates are made visible to the remote processors in the proper order (i.e. new_code is updated before the branch at address x is updated). Using the final three instructions ensures that the remote processors will see the new code the next time they execute the branch at address x.
Page 783
Figure 2-10. Updating a Code Image on a Remote Processor patch_l_and_r: [code] = new_inst // write new instruction fc.i code ;; // flush new instruction sync.i ;; // sync i stream with store // If the local processor must ensure that remote processors see // the preceding memory updates before any subsequent memory // operations, the following code is also necessary.
Page 784
Finally, software may also eliminate the mf or srlz.i instructions if it guarantees that these operations will take place elsewhere (e.g. in the operating system) before the processor attempts to execute the updated code. For example, context switch routines must contain a memory fence (see Section 2.3 on page page...
Page 785
Interruptions and Serialization This chapter discusses the interruption and serialization model. Although the Itanium architecture is an explicitly parallel architecture, faults and traps are delivered in program order based on IP, and from left-to-right in each instruction group. In other words, faults and traps are reported precisely on the instruction that caused them.
Page 786
• When an external or independent agent (I/O device, timer, another processor) requires attention from the processor, an interrupt occurs. There are several types of interrupts. An initialization interrupt occurs when the processor has received an initialization request. A Platform Management Interrupt (PMI) can be generated by the platform to request features such as power management.
Page 787
instruction address translation is disabled, the IVA register should contain the physical address of the base of the IVT. Software must further ensure that instruction and memory references from low-level interruption handlers do not generate additional interruptions until enough state has been saved and interruption collection can be re-enabled.
Page 788
Debug breakpoints, lower-privilege interception, taken branch and single step trapping are disabled. Current privilege level becomes most privileged. Intel Itanium Instruction set. Handlers execute Intel Itanium instructions. id, da, ia, dd, ed Instruction/data debug, access bit and speculation deferral bits are disabled.
Page 789
A processor based on the Itanium architecture provides the following interruption registers for collecting information about the latest interruption or the state of the machine at the time of the interruption: • IPSR – A copy of the processor status register (PSR) at the moment the interruption occurred.
Page 790
“Interruption Vector Descriptions” for details. Software can use the instruction bundle information for debug and emulation purposes. No other architectural state is modified when an interruption occurs. Note that only IIP, IPSR, ISR, and IFS are written by all interruptions (assuming PSR.ic is 1 at the time of interruption);...
Page 791
For example, assume that GR2 contains the new value for IVA and that PSR.i is 1. To modify the IVA register, software would perform the following code sequence, where the code page is mapped by an instruction translation register or instruction translation is disabled: rsm psr.i // external interrupts disabled upon next instruction...
Page 792
A typical lightweight interruption handler can operate completely out of register bank 0. If the bank 0 registers provide sufficient storage for the handler, none of the interrupted context’s register state need be saved to memory, and the handler does not need to use stacked registers.
Page 793
4. Allocate a “trap frame” to store the interrupted context’s state on the kernel memory stack, and move the interruption state (IIP, IPSR, IIPA, ISR, IFA, IFS, IIB0-1), the interrupted memory stack pointer and the interrupted predicate registers from the banked registers to the trap frame. 5.
Page 794
ssm 0x4000 ;; // Set PSR.i There is no need to explicitly serialize the PSR.i update, unless there is a requirement to force sampling of external interrupts right away. Without the serialization, the PSR.i update will occur at the very latest when the next exception causes an implicit instruction serialization to occur.
Page 795
heavyweight interruption handler), we say that a nested interruption has occurred. On a nested interruption (other than a Data Nested TLB fault) only ISR is updated by the hardware. All other interruption registers preserve their pre-interruption contents. With the exception of the Data Nested TLB fault, the Itanium architecture does not support nested interruptions.
Page 796
2:548 Volume 2, Part 2: Interruptions and Serialization...
Page 797
4-1, software is required to use different state preservation methods depending on the type of register. More details on register preservation are provided in the next two sections. ® ® Table 4-1. Preserving Intel Itanium General and Floating-point Registers Floating-point State Components...
Page 798
4.1.1 Preserving General Registers The Itanium general register file is partitioned into two register sets: GR0-31 are termed the static general registers and GR32-127 are termed the stacked general registers. Typically, st8.spill and ld8.fill instructions are used to preserve the static GRs, and the processor’s register stack engine (RSE) automatically preserves the stacked GRs.
Page 799
4.1.2 Preserving Floating-point Registers The Itanium architecture encodes a floating-point register’s control speculative state as a special unnormalized floating-point number called NaTVal. As a result, Itanium floating-point registers do not have a NaT bit. The architecture provides the stf.spill and ldf.fill instructions to save and restore floating-point register values and control speculative state.
Page 800
In principal, preserved GRs and FRs need not be spilled/filled when entering the kernel. Whatever function is called from the low-level interruption handler or the system call entry point will itself observe the calling conventions and preserve the registers. The only occasion when preserved registers need to be spilled/filled is on a process or thread context switch.
Page 801
Automatic preservation offers performance benefits: the register stack may contain only a handful of dirty registers, system call parameters can be passed on the register stack, and, upon return to the interrupted context the loadrs instruction only needs to restore registers that were actually spilled to memory. Since system call rates scale with processor performance, the RSE offers a key method for reducing the kernel’s execution time of a system call.
Page 802
two “disabled” bits, PSR.dfl and PSR.dfh, are accessible to the privileged software alone. Setting a “disabled” bit causes a fault into the disabled-fp vector upon first use (read or write) of the corresponding register set. As mentioned earlier, an involuntary kernel entry (e.g. interruption) needs to preserve all scratch floating-point registers.
Page 803
never accessible to software during the system call (see Section 4.2.2 for details). This works, because at the system call entry user-code may not have any dependencies on the state of the scratch registers. System Calls Reducing the overhead associated with system calls becomes more important as processor efficiency increases.
Page 804
the epc until the switch to the kernel backing store has been completed. Additionally, low-level operating system handlers should not only use IPSR.cpl, but should also check BSPSTORE, to determine whether they are running on the kernel backing store (imagine an external interrupt being delivered on the first instruction after the epc). 4.4.2 break/rfi The break instruction, when issued in the i, f, and m syllables, specifies an arbitrary...
Page 805
Context Switching This section discusses context switching at the user and kernel levels. 4.5.1 User-level Context Switching 4.5.1.1 Non-local Control Transfers (setjmp/longjmp) A non-local control transfer such as the C language setjmp()/longjmp() pair requires software to correctly handle the register stack and the RSE. The register stack provides the BSP application register which always contains the backing store address of the current GR32.
Page 806
Write RSC with setjmp_rsc. d. Write PFS with setjmp_bsp. 6. Restore setjmp()’s return IP into BR7. 7. Return from longjmp() into setjmp()’s caller using br.ret instruction. 4.5.1.2 User-level Co-routines The following steps need to be taken to execute a voluntary user-level thread switch. 1.
Page 807
5. Restore the default control register (DCR) of the inbound context (if the DCR is maintained on a per-process basis). 6. Restore the contents of the protection key registers associated with the inbound context. § Volume 2, Part 2: Context Management 2:559...
Page 808
2:560 Volume 2, Part 2: Context Management...
Page 809
Memory Management This chapter introduces various memory management mechanisms of the Itanium architecture: region register model, protection keys, and the virtual hash page table usage models are described. This chapter also discusses usage of the architecture translation registers and translation caches. Outlines are provided for common TLB and VHPT miss handlers.
Page 810
region register; they are not inserted into the TLB. Likewise, when software purges a translation from the processor's TLBs, the VRN bits of the address used for the purge are used only to index the corresponding region register and are not used to find a matching translation.
Page 811
In a MAS OS, the RID bits act as an address space identifier or tag. For each process-private region, a unique RID is assigned to that process by the OS. If a process needs multiple process-private regions (e.g. the process requires a private 64-bit address space), the OS assigns multiple unique RIDs for each such region.
Page 812
5.1.2 Protection Keys The Itanium architecture provides two mechanisms for applying protection to pages. The first mechanism is the access rights bits associated with each translation. These bits provide privilege level-granular access to a page. The second mechanism is the protection keys.
Page 813
running, the OS will insert a valid PKR with the protection key 0xA and the ‘rd’ bit cleared, to allow this process to read from the page. However, the ‘wd’ bit for this PKR will be set when the consumer process is running to prevent it from writing the page. The processor hardware has no notion of which protection keys belong to which process.
Page 814
The TCs are treated as a set associative cache and are not addressable by software. The TC replacement policy is determined by software. All processor models implement at least 8 instruction and 8 data TRs, and at least 1 instruction and 1 data TC entry. Software inserts translations into the TLBs via insertion instructions.
Page 815
6. Using the general registers from steps 4 and 5, execute the itr.i or itr.d instruction. A data or instruction serialization operation must be performed after the insert (for itr.d or itr.i, respectively) before the inserted translation can be referenced. Software may insert a new translation into a TR slot already occupied by another valid translation.
Page 816
The size, associativity, and replacement policy of the TC array are implementation-dependent. With the exception of the forward progress rules defined in Section 4.1.1.2, “Translation Cache (TC)” on page 2:49, software cannot depend on the existence or life-span of a TC translation, as a TC entry may be replaced or invalidated by the hardware at any time.
Page 817
A data or instruction serialization operation must be performed after the ptc.l before the translation is guaranteed to be no longer visible to the local data or instruction stream, respectively. The ptc.l instruction does not modify the page tables nor any other memory location, nor does it affect the TLB state of any processor other than the one on which it is executed.
Page 818
5.2.2.2.3 ptc.g, ptc.ga The Itanium architecture supports efficient global TLB shootdowns via the ptc.g and ptc.ga instructions. These instructions obviate the need for performing inter-processor interrupts to maintain TLB coherence in a multiprocessor system. A TLB coherence domain is defined as a group of processors in a multiprocessor system which maintain TLB coherence via hardware.
Page 819
The ptc.ga variant of the global purge instruction behaves just like the ptc.g variant, but it also removes any ALAT entries which fall into the address range specified by the global shootdown from all remote processors’ ALATs. The ptc.ga variant is intended to be used whenever a translation is remapped to a different physical address to ensure that any stale ALAT entries are invalidated.
Page 820
tables, or as a primary page table with collision chains. The long format VHPT is a much better representation for address spaces that are sparsely populated, since the short format VHPT has a linear layout and would consume a large amount of memory.
Page 821
5.3.2 Long Format The long format VHPT is organized as a hash table which contains a subset of all translation entries. The long format VHPT entries contain a 8-byte field that is ignored by the VHPT walker and can be used by the operating system to link VHPT entries to software-walkable hash collision chains if it uses the VHPT as its primary page table.
Page 822
Since the VHPT walker may abort a walk at any time and raise these faults, software must always be able to handle all TLB faults, even when the VHPT walker is enabled. Upon entry to these fault handlers, the IHA, ITIR, and IFA control registers are initialized by the hardware as follows: •...
Page 823
5.4.2 VHPT Translation Vector Processors based on the Itanium architecture does not perform recursive TLB hardware page walks. Since the VHPT is itself a virtually addressed structure, each reference performed by the walker itself goes through the TLBs and may miss. These faults are raised when the VHPT walker is enabled, but the walker misses the TLBs when attempting to service a TLB miss caused by the program.
Page 824
For a long format VHPT, additional steps are required to load bytes 16-23 of the VHPT entry and check for the correct tag; see Section 5.4.1 for more details. A separate structure other than the VHPT may be used to back VHPT translations, in which case the handler would not use the thash instruction to generate the address of the translation mapping the VHPT entry corresponding to the original faulting address.
Page 825
The processor will not deliver a Data Nested TLB fault when PSR.ic is in-flight; Data Nested TLB faults are only delivered when PSR.ic is 0. If PSR.ic is in-flight, any data references which miss the TLB and trigger a fault will raise a Data TLB fault, and the processor will set ISR.ni to 1.
Page 826
Figure 5-2. Subpaging Sub-table Native Page Table 16K PTE 4K PTE 16K PTE 4K PTE 4K PTE 001 1 4K PTE 16K PTE 16K PTE When one of the subdivided pages is referenced and does not have a translation in the TLB, a TLB miss will occur.
Page 827
Runtime Support for Control and Data Speculation An Itanium architecture-based operating system needs to handle exceptions generated by control speculative loads (ld.s or ld.sa), data speculative loads (ld.a) and architectural loads (ld) in different ways. Software does not have to worry about control or data speculative loads potentially hitting uncacheable memory with side-effects, since ld.s, ld.sa, and ld.a instructions to non-speculative memory are always deferred by the processor for details refer to Section 4.4.6, “Speculation Attributes”...
Page 828
Details on these three models are discussed in the next three sections as well as in Section 5.5.5, “Deferral of Speculative Load Faults” on page 2:105. 6.1.1 Hardware-only Deferral Hardware only deferral is configured by setting all speculation deferral bits in the DCR register (dd, da, dr, dx, dk, dp and dm) to 1.
Page 829
• ITLB.ed=0 (no control speculative recovery code): The compiler generates recovery code only for ld.sa and ld.a instructions that have speculatively executed uses. Speculation failure of ld.sa and ld.a instructions that have no speculatively executed uses can be recovered by a ld.c instruction, and hence do not require recovery code.
Page 830
The following pseudo code outlines the basic steps for an unaligned reference handler: 1. Ensure that only ISR.r is 1, and that ISR.w, ISR.x, and ISR.na are 0. 2. Inspect the ISR.sp and ISR.ed. If both are 1, then defer this control speculative load by setting IPSR.ed and rfi-ing.
Page 831
Instruction Emulation and Other Fault Handlers This chapter introduces several common emulation handlers that an Itanium architecture-based operating system must support. A general overview is given for: • Unaligned Reference Handler – emulation of misaligned memory references that the processor hardware cannot handle, or has been configured to fault on. •...
Page 832
Unsupported Data Reference Handler Processors based on the Itanium architecture do not support all types of memory references to all memory attributes. In particular: • Semaphore operations to uncacheable memory are not supported. For details consult Section 2.1.3.2, “Behavior of Uncacheable and Misaligned Semaphores” on page 2:509.
Page 833
(movl), they encode their immediate in the L and the X slot of the bundle. The Intel Itanium processor does not support the long branch instruction, brl, and requires the operating system to emulate its behavior. When an Itanium processor encounters a brl instruction, it vectors to the Illegal Operation Fault handler, regardless of the branches’...
Page 834
specified in the brl.call instruction with the IP of the successor of the brl.call (predication helps here as the Itanium instruction set does not provide an indirect move to branch register instruction). • The handler forms the 60-bit immediate IP-offset for the brl target from the i and imm20 fields from the X syllable of the bundle (the brl instruction) and the imm39 field from the L syllable of the bundle.
Page 835
754-1985 for Binary Floating-point Arithmetic (IEEE-754). It is useful in creating and maintaining floating-point exception handling software by operating system writers. ® ® Floating-point Exceptions in the Intel Itanium Architecture Floating-point exception handling in the Itanium architecture has two major responsibilities.
Page 836
SWA Faults, is limited to the scalar reciprocal and scalar reciprocal square-root approximation instructions and is not implementation dependent. It is required for the correctness of the divide and square root algorithms. 8.1.1.1 SWA Faults The Itanium architecture allows an implementation to raise SWA faults as required. Therefore an implementation-independent operating system must be able to emulate the architectural behavior of all FP instructions that can raise a floating-point exception.
Page 837
Inexact. This is a trivial case for the SWA Trap handler, since result of the second IEEE rounding is identical to the first IEEE rounding. ® Figure 8-1. Overview of Floating-point Exception Handling in the Intel ® Itanium Architecture...
Page 838
input/output register specifiers. 3. From the ISR.code and FPSR trap enable controls, determine if a SWA Trap has occurred, if not go to the last step. 4. Read the first IEEE rounded result from the FR output register. 5. From the opcode and the status field, decode the result range and precision. 6.
Page 839
At the application level, a user floating-point exception handler could handle the Itanium floating-point exception directly. This is the traditional operating system approach of providing a signal handler with a pointer to a machine-dependent data structure. It would be more convenient for the application developer if the operating system were to first transform the results to make them IEEE-754 conforming and then present the exception to the user in an abstracted manner.
Page 840
8.1.2.3 Denormal/Unnormal Operand Exception (Fault) The exception-enabled response of the Itanium arithmetic instruction to a Denormal/Unnormal Operand exception is to leave the operands unchanged and to set the D bit in the ISR.code field of the ISR register. The operating system kernel, reached via the floating-point fault vector, will then invoke the user floating-point exception handler, if one has been registered.
Page 841
Just as for overflow, the actual scaling of the result is not performed by the Itanium architecture. It is typically performed by the IEEE Filter, which is invoked before calling the user floating-point exception handler. 8.1.2.6 Inexact Exception (Trap) The exception-enabled response of an Itanium arithmetic instruction to an Inexact exception is to set the I bit (and possibly the FPA bit) in the ISR.code field of the ISR register and the Inexact flag in the appropriate status field of the FPSR register.
Page 842
2:594 Volume 2, Part 2: Floating-point System Software...
Page 843
IA-32 Application Support The Itanium architecture enables Itanium architecture-based operating systems to host IA-32 applications, Itanium architecture-based applications, as well as mixed IA-32/Itanium architecture-based applications. Unless the operating system explicitly intercepts ISA transfers (using the PSR.di), user-level code can transition between the two instruction sets without operating system intervention.
Page 844
As mentioned earlier, user-level code can transition from Itanium to IA-32 (or back) instruction sets without operating system intervention. As described in Chapter 6, ® ® “IA-32 Application Execution Model in an Intel Itanium System Environment” in Volume 1, two instructions are provided for this purpose: br.ia (an Itanium unconditional branch), and JMPE (an IA-32 register indirect and absolute jump).
Page 845
IA-32 return address (address of the IA-32 instruction following the JMPE itself) in IA_64 register GR1. ® ® 9.1.4 Procedure Calls between Intel Itanium and IA-32 Instruction Sets If procedure call linkage is required between Itanium architecture-based and IA-32 subroutines, software needs to perform additional work as described in the next two sections.
Page 846
4. Make sure JMPE knows where to return to, e.g. deposit return address for the JMPE on memory stack or pass it in an IA-32 visible register. 5. Setup IA-32 branch target in branch register. 6. Flush register stack, but no other RSE updates. 7.
Page 847
11. Ensure memory stack pointer is correctly aligned prior to returning to IA-32 code. 12. br.ia returns to IA-32 caller. IA-32 Architecture Handlers An Itanium architecture-based operating system needs to be prepared to handle exceptions from Itanium architecture-based and IA-32 code. Depending on the exception cause, exception vectors can be: •...
Page 848
® Table 9-1. IA-32 Vectors that need Itanium Architecture-based OS Support (Continued) Vector (IVA offset) Exception Name Exception Related To Expected OS Behavior IA-32 Taken Branch trap Debug Relay to debugger. IA-32 Single Step trap Debug Relay to debugger. IA-32 Invalid Opcode fault Bad Opcode Signal application.
Page 849
making the reference has completed. Since IA-32 instruction can make multiple memory references, a single IA-32 instruction may cause multiple data break points to trigger. Details on how this is communicated to software in the interrupt status register (ISR) is given in Section 9.1, “IA-32 Trap Code”...
Page 850
2:602 Volume 2, Part 2: IA-32 Application Support...
Page 851
Itanium architecture can fully leverage the large set of existing platform infrastructure and I/O devices, compatibility with existing platform infrastructure is provided in the form of direct support for Intel 8259A compatible interrupt controllers and limited support for level sensitive interrupts.
Page 852
• From external sources, e.g. external interrupt controllers or intelligent external I/O devices, or • From the processor’s LINT0 or LINT1 pins (typically connected to an Intel 8259A compatible interrupt controller), or • From internal processor sources, e.g. timers or performance monitors, or •...
Page 853
the way out of an uninterruptable code section software is not required to serialize the setting of PSR.i either, unless it is of interest to software to be able to take interrupts in the very next instruction group. A code example for this case is given below: rsm i ;;...
Page 854
10.4 External Interrupt Delivery The architectural interrupt model in Section 5.8 defines how each interrupt vector cycles through one of four states: • Inactive: there is no interrupt pending on this vector. • Pending: an interrupt has been received by the processor on this vector, but has not been accepted by the processor and has not been acquired by software.
Page 855
Software must preserve IIP and IPSR prior to re-enabling PSR.ic and PSR.i which will re-enable taking of exceptions and higher priority external interrupts. d. Issue a srlz.d instruction. This ensures that updated PSR.ic and PSR.i settings are visible, and it also makes sure that the IVR read side effect of masking lower or equal priority interrupts is visible when PSR.i becomes 1.
Page 856
10.5.1 Notation Preprocessor macros for function ENTRY and END are used in the examples to reduce duplication of code and reduce document space requirements. #define ENTRY(label) \ .text; \ .align 32;; \ .global label; \ .proc label; \ label:: #define END(label) .endp 10.5.2 TPR and XPTR Usage Example This code will allow certain interrupts to be masked by increasing/decreasing the task...
Page 857
10.5.3 EOI Usage Example This example is a typical return from an interrupt service routine to the generic interrupt handler. Interrupts are disabled before returning to the main trap handler in preparation for returning from kernel space. return_from_interrupt: // disable interrupts here rsm 0x4000 // make sure interrupts disabled // interrupt_eoi# clear the sapic/pic interrupt...
Page 858
The Interval Time Counter (ITC) gets updated at a fixed relation to the processor clock. The ITM, Interval Timer Match, is used to determine when a interval timer interrupt is generated. When the ITC matches the ITM and the timer is unmasked via ITV then an interrupt will be generated.
Page 859
the time-out value. In this case the ITM has to be adjusted in order for the next ITM to be accurate. The following algorithm could be used to adjust the next ITM before returning from the timer interrupt handler. for (;;) { itm_next = itm_next + timeout_delta + (read current ITC - read current ITM);...
Page 860
10.5.9 INTA Example External interrupt controllers, that are compatible with the Intel 8259A interrupt controller can not issue interrupt messages, so the vector number is not available at the time of the interrupt request. When an interrupt is accepted the software must check to see if it came from an external controller by the vector number (via IVR) to see if it is the ExtINT vector.
Page 861
// A single byte load from the INTA address should cause // the processor to emit the INTA cycle on the processor // system bus. Any Intel 8259A compatible external interrupt // controller must respond with the actual interrupt // vector number as the data to be loaded.
Page 862
2:614 Volume 2, Part 2: External Interrupt Architecture...
Page 863
I/O Architecture I/O devices can be accessed from Itanium architecture-based programs using regular loads and stores to uncacheable space. While cacheable Itanium memory references may be reordered by the processor, uncacheable I/O references are always presented to the platform in program order. This “sequentiality” of uncacheable references is discussed in Section 2.2.2, “Memory Attributes”...
Page 864
The mf.a instruction on the other hand ensures that all prior data memory references made by the processor issuing the mf.a have been “accepted” by the external platform. However by itself the mf.a does not guarantee that all cache coherent agents have observed all prior memory operations.
Page 865
As a result of the spreading-out of the I/O ports into individual 4KB pages, Itanium architecture-based operating system code can control IA-32 IN, OUT instruction and IA-32 or Itanium load/store accessibility to blocks of 4 virtual I/O ports using the TLBs. This allows Itanium architecture-based operating systems to securely map devices that inhabit the I/O port space to different Itanium architecture-based device drivers or to user-space Itanium architecture-based applications.
Page 867
Performance Monitoring Support Processors based on the Itanium architecture include a minimum of four performance counters which can be programmed to count processor events. These event counts can be used to analyze both hardware and software performance. Performance counters can be configured to generate a counter overflow interrupt. This interrupt can be used for event- or time-based profiling.
Page 868
The PAL firmware provides information about the performance monitor registers that are implemented on the processor through the PAL_PERF_MON_INFO PAL call. Information provided by the PAL includes bit masks which indicate which PMC/PMD registers are implemented on this processor model, as well as the implemented number of generic PMC/PMD pairs, and the counter width of the generic counters.
Page 869
model-specific processor monitoring capabilities, and is a well-defined isolated and easily replaceable software component. The following operating system services allow a kernel mode device driver to take full advantage of the performance monitors: • Allocation/Free Performance monitors – operating system should delegate management of the performance monitor resources to device driver.
Page 870
2:622 Volume 2, Part 2: Performance Monitoring Support...
Page 871
Section 1.2, “Related Documents” on page 2:505. The PAL layer is developed by Intel Corporation and delivered with the processor. The SAL, UEFI and ACPI firmware is developed by the platform manufacturer and provide a means of supporting value added platform features from different vendors.
Page 872
The order of steps within the UEFI/SAL firmware is platform implementation dependent and may vary. In general, the UEFI/SAL firmware selects a Bootstrap processor (BSP) in multiprocessor (MP) configurations early in the boot sequence. Next, UEFI/SAL will find and initialize memory and invoke PAL procedures to conduct additional processor tests to ensure the health of the processors.
Page 873
The UEFI Boot Manager displays the list of operating system choices and permits the user to select the operating system for booting. To support this functionality, the OS setup program stores the boot paths of the OS loaders and boot options in non-volatile storage managed by the UEFI firmware.
Page 874
Figure 13-2. Control Flow of Boot Process in a Multiprocessor Configuration Power On Optional Update Firmware Recovery? PALE_RESET Do System Reset PAL_RESET SALE_ENTRY SAL_RESET BSP Selection Rendez BSP? Rendezvous_1 Interrupt? Initialization PAL Late Self-test & Memory Test PAL Late Self-test Rendezvous_2 Wake APs for PAL Late Self-test...
Page 875
The register stack should be invalidated. This can be done by setting the Register Stack Configuration Register (RSC) to zero followed by a loadrs instruction. Setting the RSC to zero will put the register stack in enforced lazy mode and set the RSC.loadrs, load distance to tear point, to zero.
Page 876
Before enabling virtual addressing, the Interruption Instruction Bundle Pointer (IIP) is set to point a virtual address. This is done so when the return from interruption instruction (rfi) is executed the instruction fetched will have a virtual address. The rfi will switch modes based on IPSR values which are moved into the PSR.
Page 877
GetFeaturesCall: mov r14 = ip // Get the ip of the current bundle movl r28 = PAL_PROC_GET_FEATURES// Index of the PAL procedure movl r4 = AddressOfPALProc;;// Address of the PAL proc entry point ld8 r4 = [r4];;// Read address from local pointer mov b5 = r4 // Move address into a branch register // Compute the return address in a position independent manner...
Page 878
movl r4 = AddressOfPALProc;;// Address of the PAL proc entry point ld8 r4 = [r4];;// Read address from local pointer mov b5 = r4 // Move address into a branch register // Make the PAL_HALT_INFO procedure call. PAL_HALT_INFO uses stacked register // convention and parameters are passed with in0-in3 mov r28 = PAL_HALT_INFO;;// Index of the PAL procedure...
Page 879
the EfiExitBootServices() procedure. After this call, UEFI boot services may no longer be invoked by the OS. The UEFI runtime services execute in physical mode until the OS invokes the EFISetVirtualAddress() function to switch the UEFI to virtual mode. After this point, the UEFI runtime services may be invoked in virtual mode only.
Page 880
In general, if SAL needs to invoke a PAL procedure, it will do so in the same addressing mode in which it was called by the OS (i.e. without changing the PSR.dt, PSR.rt, and PSR.it bits). If a particular PAL procedure can only be invoked in physical mode, SAL will turn off translations and then invoke the PAL procedure.
Page 881
Figure 13-3. Correctable Machine Check Code Flow PAL_MC_RESUME OS_MCA SAL_CHECK PAL_CHECK Log Error Interrupt Return to Execution Context Figure 13-4. Uncorrectable Machine Check Code Flow OS_MCA PAL_CHECK SAL_CHECK Correct/Log Error For multiprocessor systems, machine checks are classified as local and global. A global MCA implies a system wide broadcast by hardware of an error condition.
Page 882
• Attempt to contain the error by requesting a rendezvous for all processors in the system if needed. • Hand off control to SAL for further processing, such as error logging. • Return processor error log information upon request by SAL. •...
Page 883
When an uncorrected machine check event occurs, SAL will invoke the OS_MCA handler. The functionality of this handler is dependent on the OS. At a minimum, it must call a SAL procedure to retrieve the error logging and state information and then call another SAL procedure to release these resources for future error logging and state save.
Page 884
Figure 13-5. INIT Flow PAL_INIT INIT Event SAL_INIT Write processor / platform info to save area INIT due to failure to respond to rendezvous interrupt? SAL_MC_RENDEZ Wake up Interrupt OS_INIT procedures valid? OS_INIT Return value from OS Warm boot Return to Interrupted Context SAL implementation-specific...
Page 885
13.3.3 PMI Flows Processors based on the Itanium architecture implement the Platform Management Interrupt (PMI) to enable platform developers to provide high level system functions, such as power management and security, in a manner that is transparent not only to the application software but also to the operating system.
Page 886
than the performance_index returned by PAL_GET_PSTATE, the caller responds by transitioning the processor to a lower performance P-state, which consumes less power and operates at reduced performance. Figure 13-6. Flowchart Showing P-state Feedback Policy (1) getperfindex = PAL_GET_PSTATE (2) OS computes newpstate index from busy ratio and getperfindex Reset newpstate == getperfindex?
Page 887
Code Examples OS Boot Flow Sample Code The sample code given below is a example of setting up operating system register state to prepare the processor for running in virtual mode as described in Section 13.1.2, “Operating System Boot Steps” on page 2:625.
Page 888
(p6)br.cond.sptk.few.clr Loader_RRLoop // Disable the VHPT walker and set up the minimum size for it (32K) by writing // to the page table address register (cr.pta) mov r2 = (15<<2) mov cr.pta = r2 // Initialize the protection key registers for kernel mov r2 = (1<<...
Page 889
The Translation Insertion Format looks like the following... Below is the register interface to insert entries into the TLB //1) A general register contains an address,attributes,and permissions //2) ITIR: additional info such as protection key page size info //3) IFA: specifies the virtual page number for instruction and data TLB inserts //Registers used: //---------------...
Page 890
movl r2 = 0x0 // use vpn 0 cr.ifa = r2 //Setup ITIR (Interruption TLB Insertion Register) movl r3 = ( ( 24 << 2 ) | ( 0 << 8 ) ) // 16 MB cr.itir = r3 //Now setup the general register to use with itr (insert translation //register) movl r10 =( (1 <<...
Page 893
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling1-800-548-4725, or by visiting Intel's website at http://www.intel.com.
Page 894
Part 1: Application Architecture Guide ......3:1 1.1.2 Part 2: Optimization Guide for the Intel® Itanium® Architecture ..3:1 Overview of Volume 2: System Architecture.
Page 895
Function of getf.sig ............3:143 ® ® Intel Itanium Architecture Software Developer’s Manual, Rev. 2.3...
Page 900
IA-32 application interface. This volume also describes optimization techniques used to generate high performance software. 1.1.1 Part 1: Application Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. ® ®...
Page 901
1.2.1 Part 1: System Architecture Guide ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. ® ®...
Page 902
Chapter 9, “IA-32 Interruption Vector Descriptions” lists IA-32 exceptions, interrupts and intercepts that can occur during IA-32 instruction set execution in the Itanium System Environment. ® Chapter 10, “Itanium Architecture-based Operating System Interaction Model with IA-32 Applications” defines the operation of IA-32 instructions within the Itanium System Environment from the perspective of an Itanium architecture-based operating system.
Page 903
Instruction Set Reference This volume is a comprehensive reference to the Itanium instruction set, including instruction format/encoding. ® Chapter 1, “About this Manual” provides an overview of all volumes in the Intel ® Itanium Architecture Software Developer’s Manual. Chapter 2, “Instruction Reference”...
Page 904
These resources include instructions and registers. Itanium Architecture – The new ISA with 64-bit instruction capabilities, new performance- enhancing features, and support for the IA-32 instruction set. IA-32 Architecture – The 32-bit and 16-bit Intel architecture as described in the ® Intel 64 and IA-32 Architectures Software Developer’s Manual.
Page 905
® • Intel 64 and IA-32 Architectures Software Developer’s Manual – This set of manuals describes the Intel 32-bit architecture. They are available from the Intel Literature Department by calling 1-800-548-4725 and requesting Document Numbers 243190, 243191and 243192. ® ®...
Page 906
Date of Revision Description Revision Number August 2005 Allow register fields in CR.LID register to be read-only and CR.LID checking on interruption messages by processors optional. See Vol 2, Part I, Ch 5 “Interruptions” and Section 11.2.2 PALE_RESET Exit State for details. Relaxed reserved and ignored fields checkings in IA-32 application registers in Vol 1 Ch 6 and Vol 2, Part I, Ch 10.
Page 907
Date of Revision Description Revision Number August 2002 Added Predicate Behavior of alloc Instruction Clarification (Section 4.1.2, Part I, Volume 1; Section 2.2, Part I, Volume 3). Added New fc.i Instruction (Section 4.4.6.1, and 4.4.6.2, Part I, Volume 1; Section 4.3.3, 4.4.1, 4.4.5, 4.4.6, 4.4.7, 5.5.2, and 7.1.2, Part I, Volume 2; Section 2.5, 2.5.1, 2.5.2, 2.5.3, and 4.5.2.1, Part II, Volume 2;...
Page 908
Date of Revision Description Revision Number Volume 2: Class pr-writers-int clarification (Table A-5). PAL_MC_DRAIN clarification (Section 4.4.6.1). VHPT walk and forward progress change (Section 4.1.1.2). IA-32 IBR/DBR match clarification (Section 7.1.1). ISR figure changes (pp. 8-5, 8-26, 8-33 and 8-36). PAL_CACHE_FLUSH return argument change –...
Page 909
Date of Revision Description Revision Number Volume 2: Clarifications regarding “reserved” fields in ITIR (Chapter 3). Instruction and Data translation must be enabled for executing IA-32 instructions (Chapters 3,4 and 10). FCR/FDR mappings, and clarification to the value of PSR.ri after an RFI (Chapters 3 and 4).
Page 910
Instruction Reference This chapter describes the function of each Itanium instruction. The pages of this chapter are sorted alphabetically by assembly language mnemonic. Instruction Page Conventions The instruction pages are divided into multiple sections as listed in Table 2-1. The first three sections are present on all instruction pages.
Page 911
(64-bits not including the NaT bit) where the notation GR[addr] is used. The syntactical differences between the code found in the Operation section and ANSI C is listed in Table 2-4. Table 2-3. Register File Notation Assembly Indirect Register File C Notation Mnemonic Access...
Page 912
Table 2-5. Pervasive Conditions Not Included in Instruction Description Code Condition Action Read of a register outside the current frame. An undefined value is returned (no fault). Access to a banked general register (GR 16 through GR 31). The GR bank specified by PSR.bn is accessed. PSR.ss is set.
Page 913
add — Add ) add register_form Format: ) add plus1_form, register_form ) add pseudo-op ) adds imm14_form ) addl imm22_form The two source operands (and an optional constant 1) are added and the result placed Description: in GR . In the register form the first operand is GR ;...
Page 914
addp4 addp4 — Add Pointer ) addp4 register_form Format: ) addp4 imm14_form The two source operands are added. The upper 32 bits of the result are forced to zero, Description: and then bits {31:30} of GR are copied to bits {62:61} of the result. This result is placed in GR .
Page 915
alloc alloc — Allocate Stack Frame ) alloc = ar.pfs, Format: A new stack frame is allocated on the general register stack, and the Previous Function Description: State register (PFS) is copied to GR . The change of frame size is immediate. The write of GR and subsequent instructions in the same instruction group use the new frame.
Page 916
alloc Operation: // tmp_sof, tmp_sol, tmp_sor are the fields encoded in the instruction tmp_sof = i + l + o; tmp_sol = i + l; tmp_sor = r u>> 3; check_target_register_sof(r , tmp_sof); if (tmp_sof u> 96 || r u> tmp_sof || tmp_sol u> tmp_sof || qp != 0) illegal_operation_fault();...
Page 917
and — Logical And ) and register_form Format: ) and imm8_form The two source operands are logically ANDed and the result placed in GR . In the Description: register_form the first operand is GR ; in the imm8_form the first operand is taken from the encoding field.
Page 918
andcm andcm — And Complement ) andcm register_form Format: ) andcm imm8_form The first source operand is logically ANDed with the 1’s complement of the second Description: source operand and the result placed in GR . In the register_form the first operand is ;...
Page 919
br — Branch ) br. ip_relative_form Format: btype dh target ) br. call_form, ip_relative_form btype dh b target counted_form, ip_relative_form btype dh target pseudo-op dh target ) br. indirect_form btype dh b ) br. call_form, indirect_form btype dh b pseudo-op dh b A branch condition is evaluated, and either a branch is taken, or execution continues Description:...
Page 920
the branch condition is simply the value of the specified predicate register. These basic branch types are: • cond: If the qualifying predicate is 1, the branch is taken. Otherwise it is not taken. • call: If the qualifying predicate is 1, the branch is taken and several other actions occur: •...
Page 921
group as br.ia are not allowed, since br.ia may implicitly reads all ARs. If an illegal RAW dependency is present between an AR write and br.ia, the first IA-32 instruction fetch and execution may or may not see the updated AR value. IA-32 instruction set execution leaves the contents of the ALAT undefined.
Page 922
The modulo-scheduled loop types are: • ctop and cexit: These branch types behave identically, except in the determination of whether to branch or not. For br.ctop, the branch is taken if either LC is non-zero or EC is greater than one. For br.cexit, the opposite is true. It is not taken if either LC is non-zero or EC is greater than one and is taken otherwise.
Page 924
Table 2-7. Branch Whether Hint bwh Completer Branch Whether Hint spnt Static Not-Taken sptk Static Taken dpnt Dynamic Not-Taken dptk Dynamic Taken Table 2-8. Sequential Prefetch Hint ph Completer Sequential Prefetch Hint few or none Few lines many Many lines Table 2-9.
Page 925
tmp_taken = PR[qp]; if (tmp_taken) { // tmp_growth indicates the amount to move logical TOP *up*: // tmp_growth = sizeof(previous out) - sizeof(current frame) // a negative amount indicates a shrinking stack tmp_growth = (AR[PFS].pfm.sof - AR[PFS].pfm.sol) - CFM.sof; alat_frame_update(-AR[PFS].pfm.sol, 0); rse_fatal = rse_restore_frame(AR[PFS].pfm.sol, tmp_growth, CFM.sof);...
Page 926
illegal_operation_fault(); tmp_taken = (AR[LC] != 0); if (AR[LC] != 0) AR[LC]--; break; case ‘ctop’: case ‘cexit’: // SW pipelined counted loop if (slot != 2) illegal_operation_fault(); if (btype == ‘ctop’) tmp_taken = ((AR[LC] != 0) || (AR[EC] u> 1)); if (btype == ‘cexit’)tmp_taken = !((AR[LC] != 0) || (AR[EC] u> 1)); if (AR[LC] != 0) { AR[LC]--;...
Page 927
taken_branch = 1; IP = tmp_IP; // set the new value for IP if (!impl_uia_fault_supported() && ((PSR.it && unimplemented_virtual_address(tmp_IP, PSR.vm)) || (!PSR.it && unimplemented_physical_address(tmp_IP)))) unimplemented_instruction_address_trap(lower_priv_transition, tmp_IP); if (lower_priv_transition && PSR.lp) lower_privilege_transfer_trap(); if (PSR.tb) taken_branch_trap(); Illegal Operation fault Lower-Privilege Transfer trap Interruptions: Disabled Instruction Set Transition fault Taken Branch trap...
Page 928
break break — Break ) break pseudo-op Format: ) break.i i_unit_form ) break.b b_unit_form ) break.m m_unit_form ) break.f f_unit_form ) break.x x_unit_form A Break Instruction fault is taken. For the i_unit_form, f_unit_form and m_unit_form, Description: the value specified by is zero-extended and placed in the Interruption Immediate control register (IIM).
Page 929
brl — Branch Long ) brl. Format: btype dh target ) brl. call_form btype dh b target brl. pseudo-op dh target A branch condition is evaluated, and either a branch is taken, or execution continues Description: with the next sequential instruction. The execution of a branch logically follows the execution of all previous non-branch instructions in the same instruction group.
Page 930
system is required to provide an Illegal Operation fault handler which emulates taken and not-taken long branches. Presence of this instruction is indicated by a 1 in the lb bit of CPUID register 4. See Section 3.1.11, “Processor Identification Registers” on page 1:34.
Page 931
brp — Branch Predict brp. ip_relative_form Format: ipwh ih target brp. indirect_form indwh ih b brp.ret. return_form, indirect_form indwh ih b This instruction can be used to provide to hardware early information about a future Description: branch. It has no effect on architectural machine state, and operates as a nop instruction except for its performance effects.
Page 933
bsw — Bank Switch bsw.0 zero_form Format: bsw.1 one_form This instruction switches to the specified register bank. The zero_form specifies Bank 0 Description: for GR16 to GR31. The one_form specifies Bank 1 for GR16 to GR31. After the bank switch the previous register bank is no longer accessible but does retain its current state.
Page 934
chk — Speculation Check ) chk.s pseudo-op Format: target ) chk.s.i control_form, i_unit_form, gr_form target ) chk.s.m control_form, m_unit_form, gr_form target ) chk.s control_form, fr_form target ) chk.a. data_form, gr_form aclr r target ) chk.a. data_form, fr_form aclr f target The result of a control- or data-speculative calculation is checked for success or failure.
Page 936
clrrrb clrrrb — Clear RRB clrrrb all_form Format: clrrrb.pr pred_form In the all_form, the register rename base registers (CFM.rrb.gr, CFM.rrb.fr, and Description: CFM.rrb.pr) are cleared. In the pred_form, the single register rename base register for the predicates (CFM.rrb.pr) is cleared. This instruction must be the last instruction in an instruction group;...
Page 937
clz — Count Leading Zeros ) clz Format: The number of leading zeros in GR is placed in GR Description: An Illegal Operation fault is raised on processor models that do not support the instruction. CPUID register 4 indicates the presence of the feature on the processor model.
Page 938
cmp — Compare ) cmp. register_form Format: crel ctype p ) cmp. imm8_form crel ctype p ) cmp. = r0, parallel_inequality_form crel ctype p ) cmp. , r0 pseudo-op crel ctype p The two source operands are compared for one of ten relations specified by crel. This Description: produces a boolean result which is 1 if the comparison condition is true, and 0 otherwise.
Page 939
simply uses the negative relation with an implemented type. The implemented relations and how the pseudo-ops map onto them are shown in Table 2-16 (for normal and unc type compares), and Table 2-17 (for parallel type compares). Table 2-16. 64-bit Comparison Relations for Normal and unc Compares Compare Relation Register Form is a Immediate Form is a...
Page 940
Operation: if (PR[qp]) { if (p == p illegal_operation_fault(); tmp_nat = (register_form ? GR[r ].nat : 0) || GR[r ].nat; if (register_form) tmp_src = GR[r else if (imm8_form) tmp_src = sign_ext(imm , 8); else // parallel_inequality_form tmp_src = 0; (crel == ‘eq’) tmp_rel = tmp_src == GR[r else if (crel == ‘ne’) tmp_rel = tmp_src != GR[r...
Page 942
cmp4 cmp4 — Compare 4 Bytes ) cmp4. register_form Format: crel ctype p ) cmp4. imm8_form crel ctype p ) cmp4. = r0, parallel_inequality_form crel ctype p ) cmp4. , r0 pseudo-op crel ctype p The least significant 32 bits from each of two source operands are compared for one of Description: ten relations specified by crel.
Page 945
cmpxchg cmpxchg — Compare and Exchange ) cmpxchg , ar.ccv Format: ldhint r ) cmp8xchg16. , ar.csd, ar.ccv sixteen_byte_form ldhint r A value consisting of sz bytes (8 bytes for cmp8xchg16) is read from memory starting at Description: the address specified by the value in GR .
Page 946
cmpxchg affect program functionality and may be ignored by the implementation. See Section 4.4.6, “Memory Hierarchy Control and Consistency” on page 1:69 for details. For cmp8xchg16, Illegal Operation fault is raised on processor models that do not support the instruction. CPUID register 4 indicates the presence of the feature on the processor model.
Page 947
cover cover — Cover Stack Frame cover Format: A new stack frame of zero size is allocated which does not include any registers from Description: the previous frame (as though all output registers in the previous frame had been locals). The register rename base registers are reset. If interruption collection is disabled (PSR.ic is zero), then the old value of the Current Frame Marker (CFM) is copied to the Interruption Function State register (IFS), and IFS.v is set to one.
Page 948
czx — Compute Zero Index ) czx1.l one_byte_form, left_form Format: ) czx1.r one_byte_form, right_form ) czx2.l two_byte_form, left_form ) czx2.r two_byte_form, right_form is scanned for a zero element. The element is either an 8-bit aligned byte Description: (one_byte_form) or a 16-bit aligned pair of bytes (two_byte_form). The index of the first zero element is placed in GR .
Page 950
dep — Deposit ) dep merge_form, register_form Format: ) dep merge_form, imm_form , pos ) dep.z zero_form, register_form ) dep.z zero_form, imm_form In the merge_form, a right justified bit field taken from the first source operand is Description: deposited into the value in GR r at an arbitrary bit position and the result is placed in GR r .
Page 952
epc — Enter Privileged Code Format: This instruction increases the privilege level. The new privilege level is given by the TLB Description: entry for the page containing this instruction. This instruction can be used to implement calls to higher-privileged routines without the overhead of an interruption. Before increasing the privilege level, a check is performed.
Page 953
extr extr — Extract ) extr signed_form Format: ) extr.u unsigned_form A field is extracted from GR , either zero extended or sign extended, and placed Description: right-justified in GR . The field begins at the bit position given by the second operand and extends bits to the left.
Page 954
fabs fabs — Floating-point Absolute Value ) fabs pseudo-op of: ( ) fmerge.s = f0, Format: The absolute value of the value in FR is computed and placed in FR Description: If FR is a NaTVal, FR is set to NaTVal instead of the computed result. Operation: See “fmerge —...
Page 955
fadd fadd — Floating-point Add ) fadd. pseudo-op of: ( ) fma. , f1, Format: sf f sf f and FR are added (computed to infinite precision), rounded to the precision Description: indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc, and placed in FR .
Page 956
famax famax — Floating-point Absolute Maximum ) famax. Format: sf f The operand with the larger absolute value is placed in FR . If the magnitude of FR Description: equals the magnitude of FR , FR gets FR If either FR or FR is a NaN, FR gets FR...
Page 957
famin famin — Floating-point Absolute Minimum ) famin. Format: sf f The operand with the smaller absolute value is placed in FR . If the magnitude of FR Description: equals the magnitude of FR , FR gets FR If either FR or FR is a NaN, FR gets FR...
Page 958
fand fand — Floating-point Logical And ) fand Format: The bit-wise logical AND of the significand fields of FR and FR is computed. The Description: resulting value is stored in the significand field of FR . The exponent field of FR is set to the biased exponent for 2.0 (0x1003E) and the sign field of FR...
Page 959
fandcm fandcm — Floating-point And Complement ) fandcm Format: The bit-wise logical AND of the significand field of FR with the bit-wise complemented Description: significand field of FR is computed. The resulting value is stored in the significand field of FR .
Page 960
fc — Flush Cache ) fc invalidate_line_form Format: ) fc.i instruction_cache_coherent_form In the invalidate_line form, the cache line associated with the address specified by the Description: value of GR r is invalidated from all levels of the processor cache hierarchy. The invalidation is broadcast throughout the coherence domain.
Page 961
Register NaT Consumption fault Data TLB fault Interruptions: Unimplemented Data Address fault Data Page Not Present fault Data Nested TLB fault Data NaT Page Consumption fault Alternate Data TLB fault Data Access Rights fault VHPT Data fault 3:62 Volume 3: Instruction Reference...
Page 962
fchkf fchkf — Floating-point Check Flags ) fchkf. Format: sf target The flags in FPSR.sf.flags are compared with FPSR.s0.flags and FPSR.traps. If any flags Description: set in FPSR.sf.flags correspond to FPSR.traps which are enabled, or if any flags set in FPSR.sf.flags are not set in FPSR.s0.flags, then a branch to is taken.
Page 963
fclass fclass — Floating-point Class ) fclass. Format: fcrel fctype p fclass The contents of FR are classified according to the completer as shown in Description: fclass Table 2-25. This produces a boolean result based on whether the contents of FR agrees with the floating-point number format specified by , as specified by the fclass...
Page 965
fclrf fclrf — Floating-point Clear Flags ) fclrf. Format: The status field’s 6-bit flags field is reset to zero. Description: The mnemonic values for sf are given in Table 2-23 on page 3:56. Operation: if (PR[qp]) { fp_set_sf_flags(sf, 0); None FP Exceptions: None Interruptions:...
Page 966
fcmp fcmp — Floating-point Compare ) fcmp. Format: frel fctype sf p The two source operands are compared for one of twelve relations specified by frel. This Description: produces a boolean result which is 1 if the comparison condition is true, and 0 otherwise.
Page 967
fcmp Operation: if (PR[qp]) { if (p == p illegal_operation_fault(); if (tmp_isrcode = fp_reg_disabled(f , 0, 0)) disabled_fp_register_fault(tmp_isrcode, 0); if (fp_is_natval(FR[f ]) || fp_is_natval(FR[f ])) { PR[p ] = 0; PR[p ] = 0; } else { fcmp_exception_fault_check(f , frel, sf, &tmp_fp_env); if (fp_raise_fault(tmp_fp_env)) fp_exception_fault(fp_decode_fault(tmp_fp_env));...
Page 969
fcvt.fx fcvt.fx — Convert Floating-point to Integer ) fcvt.fx. signed_form Format: sf f ) fcvt.fx.trunc. signed_form, trunc_form sf f ) fcvt.fxu. unsigned_form sf f ) fcvt.fxu.trunc. unsigned_form, trunc_form sf f is treated as a register format floating-point value and converted to a signed Description: (signed_form) or unsigned integer (unsigned_form) using either the rounding mode specified in the FPSR.sf.rc, or using Round-to-Zero if the trunc_form of the instruction is...
Page 971
fcvt.xf fcvt.xf — Convert Signed Integer to Floating-point ) fcvt.xf Format: The 64-bit significand of FR is treated as a signed integer and its register file precision Description: floating-point representation is placed in FR If FR is a NaTVal, FR is set to NaTVal instead of the computed result.
Page 972
fcvt.xuf fcvt.xuf — Convert Unsigned Integer to Floating-point ) fcvt.xuf.pc.sf pseudo-op of: ( ) fma. , f1, f0 Format: sf f is multiplied with FR 1, rounded to the precision indicated by pc (and possibly Description: FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc, and placed in FR Note: Multiplying FR with FR 1 (a 1.0) normalizes the canonical representation of an...
Page 973
fetchadd fetchadd — Fetch and Add Immediate ) fetchadd4. four_byte_form Format: ldhint r ) fetchadd8. eight_byte_form ldhint r A value consisting of four or eight bytes is read from memory starting at the address Description: specified by the value in GR .
Page 974
fetchadd Operation: if (PR[qp]) { check_target_register(r if (GR[r ].nat) register_nat_consumption_fault(SEMAPHORE); size = four_byte_form ? 4 : 8; paddr = tlb_translate(GR[r ], size, SEMAPHORE, PSR.cpl, &mattr, &tmp_unused); if (!ma_supports_fetchadd(mattr)) unsupported_data_reference_fault(SEMAPHORE, GR[r if (sem == ‘acq’) val = mem_xchg_add(inc , paddr, size, UM.be, mattr, ACQUIRE, ldhint); else // ‘rel’...
Page 975
flushrs flushrs — Flush Register Stack flushrs Format: All stacked general registers in the dirty partition of the register stack are written to the Description: backing store before execution continues. The dirty partition contains registers from previous procedure frames that have not yet been saved to the backing store. For a description of the register stack partitions, refer to Chapter 6, “Register Stack Engine”...
Page 976
fma — Floating-point Multiply Add ) fma. Format: sf f The product of FR and FR is computed to infinite precision and then FR is added to Description: this product, again in infinite precision. The resulting value is then rounded to the precision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc.
Page 978
fmax fmax — Floating-point Maximum ) fmax. Format: sf f The operand with the larger value is placed in FR . If FR equals FR , FR gets FR Description: If either FR or FR is a NaN, FR gets FR If either FR or FR is a NaTVal, FR...
Page 979
fmerge fmerge — Floating-point Merge ) fmerge.ns neg_sign_form Format: ) fmerge.s sign_form ) fmerge.se sign_exp_form Sign, exponent and significand fields are extracted from FR and FR , combined, and Description: the result is placed in FR For the neg_sign_form, the sign of FR is negated and concatenated with the exponent and the significand of FR .
Page 981
fmin fmin — Floating-point Minimum ) fmin. Format: sf f The operand with the smaller value is placed in FR . If FR equals FR , FR gets FR Description: If either FR or FR is a NaN, FR gets FR If either FR or FR is a NaTVal, FR...
Page 982
fmix fmix — Floating-point Mix ) fmix.l mix_l_form Format: ) fmix.r mix_r_form ) fmix.lr mix_lr_form For the mix_l_form (mix_r_form), the left (right) single precision value in FR Description: concatenated with the left (right) single precision value in FR . For the mix_lr_form, the left single precision value in FR is concatenated with the right single precision value in FR...
Page 983
fmix Operation: if (PR[qp]) { fp_check_target_register(f if (tmp_isrcode = fp_reg_disabled(f , 0)) disabled_fp_register_fault(tmp_isrcode, 0); if (fp_is_natval(FR[f ]) || fp_is_natval(FR[f ])) { FR[f ] = NATVAL; } else { if (mix_l_form) { tmp_res_hi = FR[f ].significand{63:32}; tmp_res_lo = FR[f ].significand{63:32}; } else if (mix_r_form) { tmp_res_hi = FR[f ].significand{31:0};...
Page 984
fmpy fmpy — Floating-point Multiply ) fmpy. pseudo-op of: ( ) fma. , f0 Format: sf f sf f The product FR and FR is computed to infinite precision. The resulting value is then Description: rounded to the precision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc.
Page 985
fms — Floating-point Multiply Subtract ) fms. Format: sf f The product of FR and FR is computed to infinite precision and then FR Description: subtracted from this product, again in infinite precision. The resulting value is then rounded to the precision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc.
Page 987
fneg fneg — Floating-point Negate ) fneg pseudo-op of: ( ) fmerge.ns Format: The value in FR is negated and placed in FR Description: If FR is a NaTVal, FR is set to NaTVal instead of the computed result. Operation: See “fmerge —...
Page 988
fnegabs fnegabs — Floating-point Negate Absolute Value ) fnegabs pseudo-op of: ( ) fmerge.ns = f0, Format: The absolute value of the value in FR is computed, negated, and placed in FR Description: If FR is a NaTVal, FR is set to NaTVal instead of the computed result. Operation: See “fmerge —...
Page 989
fnma fnma — Floating-point Negative Multiply Add ) fnma. Format: sf f The product of FR and FR is computed to infinite precision, negated, and then FR Description: is added to this product, again in infinite precision. The resulting value is then rounded to the precision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc.
Page 991
fnmpy fnmpy — Floating-point Negative Multiply ) fnmpy. pseudo-op of: ( ) fnma. Format: sf f sf f The product FR and FR is computed to infinite precision and then negated. The Description: resulting value is then rounded to the precision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc.
Page 992
fnorm fnorm — Floating-point Normalize ) fnorm. pseudo-op of: ( ) fma. , f1, f0 Format: sf f sf f is normalized and rounded to the precision indicated by pc (and possibly Description: FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc, and placed in FR If FR is a NaTVal, FR...
Page 993
for — Floating-point Logical Or ) for Format: The bit-wise logical OR of the significand fields of FR and FR is computed. The Description: resulting value is stored in the significand field of FR . The exponent field of FR is set to the biased exponent for 2.0 (0x1003E) and the sign field of FR...
Page 994
fpabs fpabs — Floating-point Parallel Absolute Value ) fpabs pseudo-op of: ( ) fpmerge.s = f0, Format: The absolute values of the pair of single precision values in the significand field of FR Description: are computed and stored in the significand field of FR .
Page 995
fpack fpack — Floating-point Pack ) fpack pack_form Format: The register format numbers in FR and FR are converted to single precision memory Description: format. These two single precision numbers are concatenated and stored in the significand field of FR .
Page 996
fpamax fpamax — Floating-point Parallel Absolute Maximum ) fpamax. Format: sf f The paired single precision values in the significands of FR and FR are compared. Description: The operands with the larger absolute value are returned in the significand field of FR If the magnitude of high (low) FR is less than the magnitude of high (low) FR , high...
Page 998
fpamin fpamin — Floating-point Parallel Absolute Minimum ) fpamin. Format: sf f The paired single precision values in the significands of FR or FR are compared. The Description: operands with the smaller absolute value is returned in the significand of FR If the magnitude of high (low) FR is less than the magnitude of high (low) FR , high...
Page 1000
fpcmp fpcmp — Floating-point Parallel Compare ) fpcmp. Format: frel sf f The two pairs of single precision source operands in the significand fields of FR and FR Description: are compared for one of twelve relations specified by frel. This produces a boolean result which is a mask of 32 1’s if the comparison condition is true, and a mask of 32 0’s otherwise.
Page 1003
fpcvt.fx fpcvt.fx — Convert Parallel Floating-point to Integer ) fpcvt.fx. signed_form Format: sf f ) fpcvt.fx.trunc. signed_form, trunc_form sf f ) fpcvt.fxu. unsigned_form sf f ) fpcvt.fxu.trunc. unsigned_form, trunc_form sf f The pair of single precision values in the significand field of FR is converted to a pair Description: of 32-bit signed integers (signed_form) or unsigned integers (unsigned_form) using...
Need help?
Do you have a question about the ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 and is the answer not in the manual?
Questions and answers