Page 1 Intel® XScale™ Microarchitecture for the PXA255 Processor User’s Manual March, 2003 Order Number: 278796...
Page 2 Information in this document is provided in connection with Intel® products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel®...
Page 3: Table Of Contents
Contents Introduction...........................1-1 About This Document ......................1-1 1.1.1 How to Read This Document ................1-1 1.1.2 Other Relevant Documents ..................1-1 High-Level Overview of the Intel® XScale™ core as Implemented in the Application Processors ......................1-2 1.2.1 ARM* Compatibility ....................1-3 1.2.2 Features........................1-3 1.2.2.1 Multiply/Accumulate (MAC)..............1-3 1.2.2.2...
Page 4 Operation When Data Caching is Disabled ............6-4 6.2.3 Cache Policies ......................6-4 6.2.3.1 Cacheability ..................6-4 6.2.3.2 Read Miss Policy ..................6-4 6.2.3.3 Write Miss Policy...................6-5 6.2.3.4 Write-Back Versus Write-Through ............6-6 6.2.4 Round-Robin Replacement Algorithm ..............6-6 6.2.5 Parity Protection ....................6-6 6.2.6 Atomic Accesses ....................6-7 Intel® XScale™ Microarchitecture User’s Manual...
Page 5 Data/Bus Request Buffer Full Mode ..............8-6 8.5.5 Stall/Writeback Statistics Mode................8-7 8.5.6 Instruction TLB Efficiency Mode ................8-8 8.5.7 Data TLB Efficiency Mode ..................8-8 Multiple Performance Monitoring Run Statistics ..............8-8 Examples ...........................8-8 Test...............................9-1 Boundary-Scan Architecture and Overview ...............9-1 Reset ..........................9-3 Intel® XScale™ Microarchitecture User’s Manual...
Page 6 10.5.2 Data Breakpoints ....................10-9 10.6 Software Breakpoints.....................10-11 10.7 Transmit/Receive Control Register (TXRXCTRL) ............10-11 10.7.1 RX Register Ready Bit (RR) ................10-12 10.7.2 Overflow Flag (OV) ...................10-13 10.7.3 Download Flag (D)....................10-13 10.7.4 TX Register Ready Bit (TR) ................10-14 Intel® XScale™ Microarchitecture User’s Manual...
Page 7 10.14.1.2 Placing the Handler in Memory ............10-41 10.14.2 Implementing a Debug Handler ................10-42 10.14.2.1 Debug Handler Entry ................10-42 10.14.2.2 Debug Handler Restrictions ..............10-42 10.14.2.3 Dynamic Debug Handler ..............10-43 10.14.2.4 High-Speed Download ..............10-44 10.14.3 Ending a Debug Session ..................10-45 Intel® XScale™ Microarchitecture User’s Manual...
Page 8 Intel® XScale™ Core Pipeline..................A-1 A.2.1 General Pipeline Characteristics ................. A-2 A.2.1.1. Number of Pipeline Stages ..............A-2 A.2.1.2. Intel® XScale™ Core Pipeline Organization ........A-2 A.2.1.3. Out Of Order Completion ..............A-3 A.2.1.4. Register Dependencies................ A-3 A.2.1.5. Use of Bypassing ................. A-3 A.2.2...
Page 9 Use of PLD Instructions ..................A-32 A.6.4 Thumb Instructions .................... A-32 Figures 1-1 Intel® XScale™ Microarchitecture Architecture Features ............1-3 3-1 Example of Locked Entries in TLB.....................3-8 4-1 Instruction Cache Organization ....................4-1 4-2 Locked Line Effect on Round Robin Replacement ..............4-6...
Page 10 10-11Code Download During a Cold Reset For Debug ..............10-35 10-12Code Download During a Warm Reset For Debug..............10-37 10-13Downloading Code in IC During Program Execution ............10-38 Intel® XScale™ Core RISC Superpipeline...........A-2 Tables 2-1 Multiply with Internal Accumulate Format..................2-4 2-2 MIA{<cond>} acc0, Rm, Rs .......................2-4 2-3 MIAPH{<cond>} acc0, Rm, Rs ....................2-5...
Page 11 10-16CP 14 Trace Buffer Register Summary................10-24 10-17Checkpoint Register (CHKPTx) ...................10-24 10-18TBREG Format ........................10-25 10-19Message Byte Formats ......................10-28 10-20LDIC Cache Functions ......................10-32 11-1 Branch Latency Penalty......................11-1 11-2 Latency Example ........................11-3 11-3 Branch Instruction Timings (Those predicted by the BTB) ............11-3 Intel® XScale™ Microarchitecture User’s Manual...
Page 12 11-12Load and Store Multiple Instruction Timings................11-8 11-13Semaphore Instruction Timings .....................11-8 11-14CP15 Register Access Instruction Timings................11-8 11-15CP14 Register Access Instruction Timings................11-8 11-16SWI Instruction Timings ......................11-8 11-17Count Leading Zeros Instruction Timings ................11-9 Pipelines and Pipe stages ................A-3 Intel® XScale™ Microarchitecture User’s Manual...
Page 13: Introduction
Intel retains the right to make changes to these specifications at any time, without notice. In particular, descriptions of features, timings, and pin-outs does not imply a commitment to implement them.
Page 14: High-Level Overview Of The Intel® Xscale™ Core As Implemented In The Application Processors
This document limits itself to describing the implementation of the Intel® XScale™ core as it is implemented in the PXA255 processor. In almost every attribute the Intel® XScale™ core used in the application processor is identical to the Intel® XScale™ core implemented in the Intel®...
Page 15: Features
Introduction The Intel® XScale™ core provides the ARM* V5T Thumb instruction set and the ARM* V5E DSP extensions. To further enhance multimedia applications, the Intel® XScale™ core includes additional Multiply-Accumulate functionality as the first instantiation of Intel® Media Processing Technology. These new operations from Intel are mapped into ARM* coprocessor space.
Page 16: Instruction Cache
1.2.2.4 Branch Target Buffer The Intel® XScale™ core provides a Branch Target Buffer (BTB) to predict the outcome of branch type instructions. It provides storage for the target address of branch type instructions and predicts the next address to present to the instruction cache when the current instruction address is that of a branch.
Page 17: Performance Monitoring
Access Port (TAP) Controller implementation, which is based on IEEE 1149.1 (JTAG) Standard Test Access Port and Boundary-Scan Architecture. The purpose of the TAP controller is to support test logic internal and external to the Intel® XScale™ core such as built-in self-test and boundary- scan.
Page 18: Terminology And Acronyms
Software should not modify reserved fields or depend on any values in reserved fields. Translation Look-aside Buffer, a cache of Page Table descriptors loaded from memory to minimize page-table walking overhead. Intel® XScale™ Microarchitecture User’s Manual...
Page 19: Programming Model
2.2.1 Big Endian versus Little Endian The Intel® XScale™ core supports both big and little endian data representation. The B-bit of the Control Register (Coprocessor 15, register 1, bit 7) selects big and little endian mode. The default behavior of the application processor at reset is little endian. To run in big endian mode, the B bit must be set before attempting any sub-word accesses to memory.
Page 20: Arm* Dsp-Enhanced Instruction Set
Base Restored Abort Model. Extensions to ARM* Architecture The Intel® XScale™ core made a few extensions to the ARM* Version 5 architecture to meet the needs of various markets and design requirements. The following is a list of the extensions which are discussed in the next sections.
Page 21: Dsp Coprocessor 0 (Cp0)
2.3.1 DSP Coprocessor 0 (CP0) The Intel® XScale™ core adds a DSP coprocessor to the architecture for the purpose of increasing the performance and the precision of audio processing algorithms. This coprocessor contains a 40- bit accumulator and 8 new instructions.
Page 22: Multiply With Internal Accumulate Format
Two new fields were created for this format, acc and opcode_3. The acc field specifies 1 of 8 internal accumulators to operate on and opcode_3 defines the operation for this format. The Intel® XScale™ core defines a single 40-bit accumulator referred to as acc0; future implementations may define multiple internal accumulators.
Page 23: Miaph{} Acc0, Rm, Rs
11-5. Specifying R15 for register Rs or Rm has unpredictable results. acc0 is defined to be 0b000 on the Intel® XScale™ core The MIAPH instruction performs two16-bit signed multiplies on packed half word data and accumulates these to a single 40-bit accumulator. The first signed multiplication is performed on the lower 16 bits of the value in register Rs with the lower 16 bits of the value in register Rm.
Page 24: Internal Accumulator Access Format
The acc field specifies 1 of 8 internal accumulators to transfer data to/ from. The Intel® XScale™ core implements a single 40-bit accumulator referred to as acc0; future implementations can specify multiple internal accumulators of varying sizes.
Page 25: Internal Accumulator Access Format
Section 7.2.13, “Register 15: Coprocessor Access Register” on page 7-14 for more details). The Intel® XScale™ core implements two instructions MAR and MRA that move two ARM* registers to acc0 and move acc0 to two ARM* registers, respectively. Table 2-5. Internal Accumulator Access Format...
Page 26: Mar{} Acc0, Rdlo, Rdhi
RdLo. Bits[39:32] of the value in acc0 are sign extended to 32 bits and moved into the register RdHi. The instruction is only executed if the condition specified in the instruction matches the condition code status. This instruction executes in any processor mode. Intel® XScale™ Microarchitecture User’s Manual...
Page 27: New
Fine Page Table” on page 2-10. Two second-level descriptor formats have been defined for the Intel® XScale™ core, one is used for the coarse page table and the other is used for the fine page table. AP bits are ARM* Access Permission controls.
Page 28: Additions To Cp15 Functionality
Tiny Page Base Address C B 1 1 The TEX (Type Extension) field is present in several of the descriptor types. In the Intel® XScale™ core, only the LSB of this field is used; this is called the X bit.
Page 29: Event Architecture
Exception Summary Table 2-11 shows all the exceptions that the Intel® XScale™ core may generate, and the attributes of each. Subsequent sections give details on each exception. A precise exception is defined as one where R14_mode always contains a pointer to locate the instruction that caused the exception.
Page 30: Prefetch Aborts
2.3.4.4 Data Aborts Two types of data aborts exist in the Intel® XScale™ core: precise and imprecise. A precise data abort is defined as one where R14_ABORT always contains the PC (+8) of the instruction that caused the exception. An imprecise abort is one where R14_ABORT contains the PC (+4) of the next instruction to execute and not the address of the instruction that caused the abort.
Page 31: Intel® Xscale™ Core Encoding Of Fault Status For Data Aborts
+ 4, which is the same for both ARM* and Thumb mode. Although the Intel® XScale™ core guarantees the Base Restored Abort Model for precise aborts, it cannot do so in the case of imprecise aborts. A Data Abort handler may encounter an updated base register if it is invoked because of an imprecise abort.
Page 32: Events From Preload Instructions
When execution reaches the end of the list, the PLD on address 0x0 will not cause a fault. Rather, it will be ignored and the loop will terminate normally. 2-14 Intel® XScale™ Microarchitecture User’s Manual...
Page 33: Debug Events
MOVS R0, R1 ; Advance to next node. At end of list? BNE sumList ; If not then loop 2.3.4.6 Debug Events Debug events are covered in Section 10.4, “Debug Exceptions” on page 10-5. Intel® XScale™ Microarchitecture User’s Manual 2-15...
Page 34 Programming Model 2-16 Intel® XScale™ Microarchitecture User’s Manual...
Page 35: Memory Management
TLB along with the access rights and attributes of the page or section. These translations can also be locked down in either TLB to guarantee the performance of critical routines. The Intel® XScale™ core allows system software to associate various attributes with regions of memory: •...
Page 36: Version 4 Vs. Version 5
These attributes are ignored when the MMU is disabled. To allow compatibility with older system software, the new Intel® XScale™ core attributes take advantage of encoding space in the descriptors that were formerly reserved and defaulted to zero.
Page 37: Details On Data Cache And Write Buffer Behavior
Thus software may issue a fence to impose a partial ordering on memory accesses. Table 3-3 on page 3-4 shows the circumstances in which memops act as fences. Intel® XScale™ Microarchitecture User’s Manual...
Page 38: Exceptions
An individual entry in the data or instruction TLB can also be invalidated. See Table 7-13, “TLB Functions” on page 7-11 for a listing of commands supported by the Intel® XScale™ core. Intel® XScale™ Microarchitecture User’s Manual...
Page 39: Enabling/Disabling
Locking entries into either the instruction TLB or data TLB reduces the available number of entries (by the number that was locked down) for hardware to cache other virtual to physical address translations. A procedure for locking entries into the instruction TLB is shown in Example 3-2 on page 3-6. Intel® XScale™ Microarchitecture User’s Manual...
Page 40 Software should disable interrupts (FIQ or IRQ) in this case. As a general rule, software should avoid locking in anything other than Supervisor mode. The proper procedure for locking entries into the data TLB is shown in Example 3-3 on page 3-7. Intel® XScale™ Microarchitecture User’s Manual...
Page 41: Round-Robin Replacement Algorithm
Only entries 0 through 30 can be locked in either TLB; entry 31 can never be locked. If the lock pointer is at entry 31, a lock operation will update the TLB entry with the translation and ignore the lock. In this case, the round-robin pointer will stay at entry 31. Intel® XScale™ Microarchitecture User’s Manual...
Page 42: Example Of Locked Entries In Tlb
Memory Management Figure 3-1. Example of Locked Entries in TLB Eight entries locked, 24 entries available for round robin replacement entry 0 entry 1 entry 7 entry 8 entry 22 entry 23 entry 30 entry 31 Intel® XScale™ Microarchitecture User’s Manual...
Page 43: Instruction Cache
Instruction Cache The Intel® XScale™ core instruction cache enhances performance by reducing the number of instruction fetches from external memory. The cache provides fast execution of cached code. Code can also be locked down when guaranteed or fast access time is required. An additional 2Kbyte mini instruction cache is used exclusively during debugging, see Section 10.13.6...
Page 44: Operation
Each external fetch request uses a fetch buffer that holds 32- bytes and eight valid bits, one for each word. A miss causes the following: 1. A fetch buffer is allocated. Intel® XScale™ Microarchitecture User’s Manual...
Page 45: Round-Robin Replacement Algorithm
1 parity bit. The instruction cache tag is not parity protected. When a parity error is detected on an instruction cache access, a prefetch abort exception occurs if the Intel® XScale™ core attempts to execute the instruction. Before servicing the exception, hardware places a notification of the error in the Fault Status Register (Coprocessor 15, register 5).
Page 46: Instruction Fetch Latency
Instruction Fetch Latency The instruction fetch latency is dependent on the core to memory frequency ratio, system bus bandwidth, system memory, etc. The outstanding external memory bus activity on the PXA255 processor will have the highest impact on instruction fetch latency.
Page 47: Instruction Cache Control
; The instruction cache is guaranteed to be invalidated at this point; the next ; instruction sees the result of the invalidate command. The Intel® XScale™ core also supports invalidating an individual line from the instruction cache. Table 7-12, “Cache Functions” on page 7-9 for the exact command.
Page 48: Locking Instructions In The Instruction Cache
2: 28 ways locked, only way28-31 available for replacement set 31: all 32 ways available for round robin replacement set 31 set 0 set 1 set 2 way 0 way 1 way 7 way 8 way 22 way 23 way 30 way 31 Intel® XScale™ Microarchitecture User’s Manual...
Page 49: Unlocking Instructions In The Instruction Cache
4.3.5 Unlocking Instructions in the Instruction Cache The Intel® XScale™ core provides a global unlock command for the instruction cache. There is no unlock function for individual lines in the cache. Writing to coprocessor 15, register 9 unlocks all the locked lines in the instruction cache and leaves them valid.
Page 50 Instruction Cache Intel® XScale™ Microarchitecture User’s Manual...
Page 51: Branch Target Buffer
The Intel® XScale™ core uses dynamic branch prediction to reduce the penalties associated with changing the flow of program execution. The Intel® XScale™ core features a branch target buffer that provides the instruction cache with the target address of branch type instructions. The branch target buffer is implemented as a 128-entry, direct mapped cache.
Page 52: Reset
Once a branch is stored in the BTB, the history bits are updated upon every execution of the branch as shown in Figure 5-2. BTB Control 5.2.1 Disabling/Enabling The BTB is always disabled with Reset. Software enables the BTB through the Control Register bit[11] in coprocessor 15 (see Section 7.2.2). Intel® XScale™ Microarchitecture User’s Manual...
Page 53: Invalidation
Section 7.2.7, “Register 7: Cache Functions” on page 7-9. 3. The BTB is invalidated when the Process ID Register is written. 4. The BTB is invalidated when the instruction cache is invalidated via CP15, register 7 functions. Intel® XScale™ Microarchitecture User’s Manual...
Page 54 Branch Target Buffer Intel® XScale™ Microarchitecture User’s Manual...
Page 55: Data Cache
The Intel® XScale™ core data cache enhances performance by reducing the number of data accesses to and from external memory. There are two data cache structures in the Intel® XScale™ core, a 32 Kbyte data cache and a 2 Kbyte mini-data cache. An eight entry write buffer and a four entry fill buffer are also implemented to decouple the Intel®...
Page 56: Mini-Data Cache Overview
The mini-data cache is virtually addressed and virtually tagged and supports the same caching policies as the data cache. However, lines can not be locked into the mini-data cache. Intel® XScale™ Microarchitecture User’s Manual...
Page 57: Write Buffer And Fill Buffer Overview
6.1.3 Write Buffer and Fill Buffer Overview The Intel® XScale™ core employs an eight entry write buffer, each entry containing 16 bytes. Stores to external memory are first placed in the write buffer and subsequently taken out when the bus is available.
Page 58: Data Cache And Mini-Data Cache Operation
If so, the current request is placed in the pending buffer and waits until the previously requested fill completes, after which it accesses the cache again, to obtain the request data and returns it to the destination register. Intel® XScale™ Microarchitecture User’s Manual...
Page 59: Write Miss Policy
For the PXA255 processor, the size of a data load depends also on the memory bank addressed in the access. For example, all 32-bit wide SDRAM reads are bursts of 4 words. All loads from this SDRAM generate a read of 4 words, despite that for uncacheable loads only the object the core requests will be used.
Page 60: Write-Back Versus Write-Through
6.2.3.4 Write-Back Versus Write-Through The Intel® XScale™ core supports write-back caching or write-through caching, controlled through the MMU page attributes. When write-through caching is specified, all store operations are written to external memory even if the access hits the cache. This feature keeps the external memory coherent with the cache, i.e., no dirty bits are set for this region of memory in the data/...
Page 61: Atomic Accesses
MRC p15, 0, r0, c1, c0, 0; Get current control register ORR r0, r0, #4 ; Enable D-Cache by setting ‘C’ (bit 2) MCR p15, 0, r0, c1, c0, 0; And update the Control register Intel® XScale™ Microarchitecture User’s Manual...
Page 62: Invalidate & Clean Operations
This allocation evicts any cache dirty data back to external memory. Example 6-2 on page 6-9 shows how data cache can be cleaned. Intel® XScale™ Microarchitecture User’s Manual...
Page 63 It must reside in a page that is marked as mini Data Cache cacheable (see Section 2.3.2). The time it takes to execute a global clean operation depends on the number of dirty lines in cache. Intel® XScale™ Microarchitecture User’s Manual...
Page 64: Re-Configuring The Data Cache As Data Ram
The data cache can only be unlocked by using the global unlock command See Table 7-14, “Cache Lockdown Functions” on page 7-11. The invalidate-entry command should not be issued to a locked line as this will render the line useless until a global unlock is issued. 6-10 Intel® XScale™ Microarchitecture User’s Manual...
Page 65 ; in R1 to the next cache line. DRAIN SUBS R0, R0, #1; Decrement loop count BNE LOOP1 ; Turn off data cache locking DRAIN R2, #0x0 P15,0,R2,C9,C2,0 ; Take the data cache out of lock mode. CPWAIT Intel® XScale™ Microarchitecture User’s Manual 6-11...
Page 66 For this reason, system software should ensure the memory address used in the PLD is correct. If this cannot be ascertained, replace the PLD with a LDR instruction that targets a scratch register. 6-12 Intel® XScale™ Microarchitecture User’s Manual...
Page 67: Write Buffer/Fill Buffer Operation And Control
Before locking, the programmer must ensure that no part of the target data range is already resident in the cache. The Intel® XScale™ core will not refetch such data, which will result in it not being locked into the cache. If there is any doubt as to the location of the targeted memory data, the cache should be cleaned and invalidated to prevent this scenario.
Page 68 The write buffer and fill buffer support a drain operation, such that before the next instruction executes, all the Intel® XScale™ core data requests to external memory have completed. See Table 7-12, “Cache Functions” on page 7-9 for the exact command.
Page 69: Configuration
7-2. Any access to CP14 in user mode will cause an Undefined Instruction exception. Coprocessors CP15 and CP14 on the Intel® XScale™ core do not support access via CDP, MRRC, or MCRR instructions. An attempt to access these coprocessors with these instructions will result in an Undefined Instruction exception.
Page 70: Mrc/Mcr Format
0 = MCR 1 = MRC 19:16 CRn - specifies which coprocessor register 15:12 Rd - General Purpose Register, R0..R15 The Intel® XScale™ core defines three coprocessors: 11:8 cp_num - coprocessor number 0b1111 = CP15 0b1110 = CP14 0x0000 = CP0...
Page 71: Cp15 Registers
- coprocessor number 0b1110 = CP14 CP0-13 & CP15 = Undefined Exception 8-bit word offset CP15 Registers Table 7-3 lists the CP15 registers implemented in the Intel® XScale™ core. Table 7-3. CP15 Registers Register (CRn) Opcode_2 Access Description Read / Write-Ignored...
Page 72: Register 0: Id & Cache Type Registers
The Cache Type Register is selected when opcode_2=1 and describes the cache configuration of the Intel® XScale™ core. These values are device specific to the PXA255 processor, for the full set of potential values consult the ARM* Architecture Reference Manual.
Page 73: Register 1: Control & Auxiliary Control Registers
Register 1 is made up of two registers, one that is compliant with ARM* Version 5 and referred by opcode_2 = 0x0, and the other which is specific to the Intel® XScale™ core is referred by opcode_2 = 0x1. The latter is known as the Auxiliary Control Register.
Page 74: Arm* Control Register
The configuration of the mini-data cache must be setup before any data access is made that may be cached in the mini-data cache. Once data is cached, software must ensure that the mini-data cache has been cleaned and invalidated before the mini-data cache attributes can be changed. Intel® XScale™ Microarchitecture User’s Manual...
Page 75: Register 2: Translation Table Base Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 Translation Table Base reset value: unpredictable Bits Access Description Translation Table Base - Physical address of the base of 31:14 Read / Write the first-level descriptor table 13:0 Read-unpredictable / Write-as-Zero Reserved Intel® XScale™ Microarchitecture User’s Manual...
Page 76: Register 3: Domain Access Control Register
Read / Write accessed when a data abort occurred Status - Used along with the X-bit above to determine the Read / Write type of cycle type that generated the exception. See “Event Architecture” on page 2-11 Intel® XScale™ Microarchitecture User’s Manual...
Page 77: Register 6: Fault Address Register
The Drain Write Buffer function not only drains the write buffer but also drains the fill buffer. The Intel® XScale™ core does not check permissions on addresses supplied for cache or TLB functions. Because only privileged software may execute these functions, full accessibility is assumed.
Page 78: Register 8: Tlb Operations
To invalidate the TLBs the commands below are required. All operations defined in Table 7-13 work regardless of whether the cache is enabled or disabled. This register is write-only. Reads from this register, as with an MRC, have an undefined effect. 7-10 Intel® XScale™ Microarchitecture User’s Manual...
Page 79: Register 9: Cache Lock Down
31:1 Read-unpredictable / Write-as-Zero Reserved Data Cache Lock Mode (L) 0 = No locking occurs Read-unpredictable / Write 1 = Any fill into the data cache while this bit is set gets locked in Intel® XScale™ Microarchitecture User’s Manual 7-11...
Page 80: Register 10: Tlb Lock Down
7.2.11 Register 13: Process ID The Intel® XScale™ core supports the remapping of virtual addresses through a Process ID (PID) register. This remapping occurs before the instruction cache, instruction TLB, data cache and data TLB are accessed. The PID register controls when virtual addresses are remapped and to what value.
Page 81: The Pid Register Affect On Addresses
IBCR1), one data breakpoint address register (DBR0), one configurable data mask/address register (DBR1), and one data breakpoint control register (DBCON). The Intel® XScale™ core also supports a 2K byte mini instruction cache for debugging and a 256 entry trace buffer that records program execution information.
Page 82: Register 15: Coprocessor Access Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 reset value: 0x0000_0000 Bits Access Description Reserved - Should be programmed to zero for future 31:16 Read-unpredictable / Write-as-Zero compatibility 7-14 Intel® XScale™ Microarchitecture User’s Manual...
Page 83: Cp14 Registers
OS has to maintain a list of what processes are modifying CP0 and their associated state. A system programmer making this OS change should include code for coprocessors CP0 through CP13. Although the PXA255 processor only supports CP0, future products may implement additional coprocessor functionality from CP1-CP13.
Page 84: Registers 0-3: Performance Monitoring
7-25. To enter any of these modes, write the appropriate data to CP14, register 7 (PWRMODE). Software may read this register, but since software only runs during ACTIVE mode, it will always read zeroes from the M field. 7-16 Intel® XScale™ Microarchitecture User’s Manual...
Page 85: Registers 8-15: Software Debug
10 through 13 support a 256 entry trace buffer. Register 14 and 15 are the debug link register and debug SPSR (saved program status register). These registers are explained in more detail in Chapter 10, “Software Debug”. Opcode_2 and CRm must be zero. Intel® XScale™ Microarchitecture User’s Manual 7-17...
Page 86: Accessing The Debug Registers
0b1101 MCR p14, 0, Rd, c13, c0, 0 Read Transmit and Receive Debug Control 0b1110 MRC p14, 0, Rd, c14, c0, 0 Register TXRXCTRL) Write TXRXCTRL 0b1110 MCR p14, 0, Rd, c14, c0, 0 7-18 Intel® XScale™ Microarchitecture User’s Manual...
Page 87: Performance Monitoring
Performance Monitoring This chapter describes the performance monitoring facility of the Intel® XScale™ core. The events that are monitored provide performance information for compiler writers, system application developers and software programmers. Overview The Intel® XScale™ core hardware provides two 32-bit performance counters that allow two unique events to be monitored simultaneously.
Page 88: Performance Count Registers (Pmn0 - Pmn1; Cp14 - Register 2 And 3, Respectively)
2 cycles it takes to generate an overflow interrupt. Performance Monitor Control Register (PMNC) The performance monitor control register (PMNC) is a coprocessor register that: • controls which events PMN0 and PMN1 will monitor Intel® XScale™ Microarchitecture User’s Manual...
Page 89: Performance Monitor Control Register (Cp14, Register 0)
Bit 4 = performance counter 0 interrupt enable 0 = disable interrupt 1 = enable interrupt Clock Counter Divider (D) - Read / Write 0 = CCNT counts every processor clock cycle 1 = CCNT counts every 64 processor clock cycle Intel® XScale™ Microarchitecture User’s Manual...
Page 90: Managing The Pmnc
PMNC register. The interrupt will remain asserted until software clears the overflow flag by writing a one to the flag that is set. Note that the PXA255 processor Interrupt Controller and the CPSR interrupt bit must be enabled in order for software to receive the interrupt.
Page 91: Instruction Cache Efficiency Mode
PMN1 counts the number of instruction fetch requests to external memory. Each of these requests loads 32 bytes at a time due to the instruction fetch buffers, even when the memory page is marked as uncached. Intel® XScale™ Microarchitecture User’s Manual...
Page 92: Data Cache Efficiency Mode
The average number of cycles the processor stalled waiting for an instruction fetch from external memory to return. This is calculated by dividing PMN0 by PMN1. If the average is high then the Intel® XScale™ core may be starved of memory access due to other bus traffic. •...
Page 93: Stall/Writeback Statistics Mode
Performance Monitoring is high, possibly due to starvation, these Data Cache buffers will become full. This performance monitoring mode is provided to see if the Intel® XScale™ core is being starved of the bus external to the Intel® XScale™ core.
Page 94: Instruction Tlb Efficiency Mode
In this example, the events selected with the Instruction Cache Efficiency mode are monitored and CCNT is used to measure total execution time. Sampling time ends when PMN0 overflows which will generate an IRQ interrupt. Intel® XScale™ Microarchitecture User’s Manual...
Page 95 Instruction Cache miss-rate = 100 * PMN1/PMN0 = 5% CPI = (CCNT + 2^32)/Number of instructions executed = 2.4 cycles/instruction In the contrived example above, the instruction cache had a miss-rate of 5% and CPI was 2.4. Intel® XScale™ Microarchitecture User’s Manual...
Page 96 Performance Monitoring 8-10 Intel® XScale™ Microarchitecture User’s Manual...
Page 97: Test
JTAG, an acronym for the Joint Test Action Group. The JTAG interface on the application processor can be used as a hardware interface for software debugging of PXA255 systems. This interface is described in Chapter 10, “Software Debug.”...
Page 98: Reset
Idcode instruction is selected. If TCK is pulsed, the contents of the ID register are clocked out of TDO. If the boundary-scan interface is not to be used, then the nTRST pin may be tied permanently low or to the nRESET pin. Intel® XScale™ Microarchitecture User’s Manual...
Page 99: Instruction Register
01101 private 00100 clamp 01110 - 01111 not used 00101 private 10000 dbgtx 00110 not used 10001 - 11001 private 00111 ldic 11010 - 11101 not used 01000 highz 11110 idcode 01001 dcsr 11111 bypass Intel® XScale™ Microarchitecture User’s Manual...
Page 100: Jtag Instruction Descriptions
CAPTURE_DR state. While this instruction is in effect, all other IEEE 1149.1 11111 test data registers have no effect on the operation of the system. Test data Required registers with both test and system functionality perform their system functions when this instruction is selected. Intel® XScale™ Microarchitecture User’s Manual...
Page 101: Test Data Registers
This is to prevent a scan operation from disabling power to the device and/or resetting external components. The following pins are not part of the boundary-scan shift-register: • PEXTAL • PXTAL • TEXTAL Intel® XScale™ Microarchitecture User’s Manual...
Page 102 JTAG reset (from forcing nTRST low or entering the Test Logic Reset state). The PXA255 256-pin PBGA package boundary scan pin order is shown in Figure 9-2 on page 9-6.
Page 103: Device Identification (Id) Code Register
The high-order 4 bits of the ID register contains the version number of the silicon and changes with each new revision. There is no parallel output from the ID register. The 32-bit device identification code is loaded into the ID register from its parallel inputs during the CAPTURE-DR state. Intel® XScale™ Microarchitecture User’s Manual...
Page 104: Data Specific Registers
This prevents a scan operation from turning off power to the application processor. For greater detail on the state machine and the public instructions, refer to IEEE 1149.1 Standard Test Access Port and Boundary-Scan Architecture Document. Intel® XScale™ Microarchitecture User’s Manual...
Page 105: Test Logic Reset State
The TAP controller enters the Run-Test/Idle state between scan operations. The controller remains in this state as long as TMS is held low. In the Run-Test/Idle state the instruction is runbist performed; the result is reported in the RUNBIST register. Instructions that do not call functions Intel® XScale™ Microarchitecture User’s Manual...
Page 106: Select-Dr-Scan State
If TMS is held low on the rising edge of TCK, the controller enters the Pause-DR state. The instruction does not change while the TAP controller is in this state. All test data registers selected by the current instruction retain their previous value during this state. 9-10 Intel® XScale™ Microarchitecture User’s Manual...
Page 107: Pause-Dr State
The instruction does not change in this state. 9.5.11 Capture-IR State When the controller is in the Capture-IR state, the shift register contained in the instruction register loads the fixed value 0001 on the rising edge of TCK. Intel® XScale™ Microarchitecture User’s Manual 9-11...
Page 108: Shift-Ir State
The instruction shifted into the instruction register is latched onto the parallel output from the shift- register path on the falling edge of TCK. Once latched, the new instruction becomes the current instruction. Test data registers selected by the current instruction retain their previous values. 9-12 Intel® XScale™ Microarchitecture User’s Manual...
Page 109 If TMS is held high on the rising edge of TCK, the controller enters the Select-DR-Scan state. If TMS is held low on the rising edge of TCK, the controller enters the Run-Test/Idle state. Intel® XScale™ Microarchitecture User’s Manual 9-13...
Page 110 Test 9-14 Intel® XScale™ Microarchitecture User’s Manual...
Page 111: Software Debug
The debugger can then restart execution of the application. The external debug interface to the PXA255 processor is via the JTAG port. Further details on the JTAG interface can be found in Section 9, “Test”.
Page 112: Monitor Mode
JTAG interface. This is to allow an external debugger to have access to the internal state of the processor. For the details of which bits can be accessed see Table 10-8, Table 10-12 Table 10-3. 10-2 Intel® XScale™ Microarchitecture User’s Manual...
Page 113: Debug Control And Status Register (Dcsr)
Software Read Only unchanged Trap Software Interrupt (TS) JTAG Read / Write Software Read Only unchanged Trap Undefined Instruction (TU) JTAG Read / Write Software Read Only unchanged Trap Reset (TR) JTAG Read / Write Intel® XScale™ Microarchitecture User’s Manual 10-3...
Page 114: Global Enable Bit (Ge)
A debug exception is generated before the instruction in the exception vector executes. Software running on the Intel® XScale™ core must set the Global Enable bit and the debugger must set the Halt Mode bit and the appropriate vector trap bit through JTAG to set up a non-reset vector trap.
Page 115: Sticky Abort Bit (Sa)
Buffer”. 10.4 Debug Exceptions A debug exception causes the processor to re-direct execution to a debug event handling routine. The Intel® XScale™ core debug architecture defines the following debug exceptions: 1. instruction breakpoint 2. data breakpoint 3. software breakpoint 4. external debug break 5.
Page 116: Halt Mode
Section 10.13, “Downloading Code into the Instruction Cache” on page 10-30 for details about downloading code into the instruction cache. During Halt mode, software running on the Intel® XScale™ core cannot access DCSR, or any of hardware breakpoint registers, unless the processor is in Special Debug State (SDS), described below.
Page 117: Monitor Mode
The following debug exceptions cause data aborts: • data breakpoint • external debug break • trace-buffer full break When the vector table is relocated (CP15 Control Register[13] = 1), the debug vector is relocated to 0xFFFF_0000 Intel® XScale™ Microarchitecture User’s Manual 10-7...
Page 118: Hw Breakpoint Resources
10.5 HW Breakpoint Resources The Intel® XScale™ core debug architecture defines two instruction and two data breakpoint registers, denoted IBCR0, IBCR1, DBR0, and DBR1. The instruction and data address breakpoint registers are 32-bit registers. The instruction breakpoint causes a break before execution of the target instruction.
Page 119: Instruction Breakpoints
Single step execution is accomplished using the instruction breakpoint registers and must be completely handled in software (either on the host or by the debug handler). 10.5.2 Data Breakpoints The Intel® XScale™ core debug architecture defines two data breakpoint registers (DBR0, DBR1). The format of the registers is shown in Table 10-6.
Page 120: Data Breakpoint Controls Register (Dbcon)
On unaligned memory accesses, breakpoint address comparison is done on a word-aligned address (aligned down to word boundary). 10-10 Intel® XScale™ Microarchitecture User’s Manual...
Page 121: Software Breakpoints
All of the bits in the TXRXCTRL register are placed such that they can be read directly into the CC flags in the CPSR with an MRC (with Rd = PC). The subsequent instruction can then conditionally execute based on the updated CC value Intel® XScale™ Microarchitecture User’s Manual 10-11...
Page 122: Rx Register Ready Bit (Rr)
Before the high-speed download can start, both the debugger and debug handler must be synchronized, such that the debug handler is executing a routine that supports the high-speed download. 10-12 Intel® XScale™ Microarchitecture User’s Manual...
Page 123: Overflow Flag (Ov)
Table 10-10. High-Speed Download Handshaking States Debugger Actions Debugger wants to transfer code into the Intel® XScale™ core system memory. Prior to starting download, the debugger must poll the RR bit until it is clear. Once the RR bit is clear, indicating the debug handler is ready, the debugger starts the download.
Page 124: Tx Register Ready Bit (Tr)
RX and the data is ready for the debug handler to read. loop: p14, 0, r15, c14, c0, 0# read the handshaking bit in TXRXCTRL 10-14 Intel® XScale™ Microarchitecture User’s Manual...
Page 125: Transmit Register (Tx)
JTAG), handshaking is required to prevent the debugger from writing new data to the register before the debug handler reads the previous data out. The handshaking is described in Section 10, “RX Register Ready Bit (RR)”. Intel® XScale™ Microarchitecture User’s Manual 10-15...
Page 126: Debug Jtag Access
Debug Control and Status Register (DCSR). The debugger can only modify certain bits through JTAG, but can read the entire register. The SELDCSR instruction also allows the debugger to generate an external debug break. 10-16 Intel® XScale™ Microarchitecture User’s Manual...
Page 127: Seldcsr Jtag Register
Status Register (DCSR)” on page 10-3 are updated. An external host and the debug handler running on the Intel® XScale™ core must synchronize access to the DCSR. If one side writes the DCSR at the same time the other side reads the DCSR, the results are unpredictable.
Page 128: Dbg.hld_Rst
Debug mode will not be entered until all processor activity has ceased in an orderly fashion. 10.10.2.3 DBG.DCSR The DCSR is updated with the value loaded into DBG.DCSR following an Update_DR. Only bits specified as writable by JTAG in Table 10-3 are updated. 10-18 Intel® XScale™ Microarchitecture User’s Manual...
Page 129: Dbgtx Jtag Command
A ‘1’ captured in DBG_SR[0] indicates the captured TX data is valid. After doing a Capture_DR, the debugger must place the JTAG state machine in the Shift_DR state to guarantee that a debugger read clears TXRXCTRL[28]. Intel® XScale™ Microarchitecture User’s Manual 10-19...
Page 130: Dbgrx Jtag Command
A Capture_DR loads TXRXCTRL[31] into DBG_SR[0]. The other bits in DBG_SR are loaded as shown in Figure 10-3. The captured data is scanned out during the Shift_DR state. While polling TXRXCTRL[31], incorrectly setting DBG_SR[35] or DBG_SR[1] will cause unpredictable behavior following an Update_DR. 10-20 Intel® XScale™ Microarchitecture User’s Manual...
Page 131: Rx Write Logic
The bits in the DBGRX data register (Figure 10-5) are used by the debugger to send data to the processor. The data register also contains a bit to flush previously written data and a high-speed download flag. Intel® XScale™ Microarchitecture User’s Manual 10-21...
Page 132: Dbg.rr
DBG.RX is written into the RX register based on the output of the RX Write Logic. Any data that needs to be sent from the debugger to the processor must be loaded into DBG.RX with DBG.V set to 1. DBG.RX is loaded from DBG_SR[34:3] when the JTAG enters the Update_DR state. 10-22 Intel® XScale™ Microarchitecture User’s Manual...
Page 133: Dbg.d
DBG.D is provided for use during high speed download. This bit is written directly to TXRXCTRL[29]. The debugger sets DBG.D when downloading a block of code or data to the Intel® XScale™ core system memory. The debug handler then uses TXRXCTRL[29] as a branch flag to determine the end of the loop.
Page 134: Checkpoint Registers
Read/Write target address for corresponding entry in trace buffer The two checkpoint registers (CHKPT0, CHKPT1) on the Intel® XScale™ core provide the debugger with two reference addresses to use for re-constructing the trace history. When the trace buffer is enabled, reading and writing to either checkpoint register has unpredictable results.
Page 135: Trace Buffer Register (Tbreg)
10.11.2 Trace Buffer Usage The Intel® XScale™ core trace buffer is 256 bytes in length. The first byte read from the buffer represents the oldest trace history information in the buffer. The last (256th) byte read represents the most recent entry in the buffer. The last byte read from the buffer will always be a message byte.
Page 136: High Level View Of Trace Buffer
Bytes”). If the first non-zero entry is any other type of message byte, then these 0’s indicate that the trace buffer has not wrapped around and that first non-zero entry is the start of the trace. 10-26 Intel® XScale™ Microarchitecture User’s Manual...
Page 137: Trace Buffer Entries
MMMM = Message Type Bits M = Message Type Bit CCCC = Incremental Word Count VVV = exception vector[4:2] CCCC = Incremental Word Count Exception Format Non-exception Format Table 10-19 shows all of the possible trace messages. Intel® XScale™ Microarchitecture User’s Manual 10-27...
Page 138: Exception Message Byte
Non-exception Message Byte Non-exception message bytes are used for direct branches, indirect branches, and rollovers. In a non-exception message byte, the 4-bit message type field (MMMM) specifies the type of message (refer to Table 10-19). 10-28 Intel® XScale™ Microarchitecture User’s Manual...
Page 139: Address Bytes
MSB of the target address is read out first; the LSB is the fourth byte read out; and the indirect branch message byte is the fifth byte read out. The byte organization of the indirect branch message is shown in Figure 10-8. Intel® XScale™ Microarchitecture User’s Manual 10-29...
Page 140: Downloading Code Into The Instruction Cache
The Intel® XScale™ core supports loading either instruction cache during reset and during program execution. Loading the instruction cache during normal program execution requires a strict handshaking protocol between software running on the Intel® XScale™ core and the external host.
Page 141: Ldic Jtag Data Register
All LDIC functions and data consists of 33 bit packets which are scanned into LDIC_SR1 during the Shift_DR state. Update_DR parallel loads LDIC_SR1 into LDIC_REG which is then synchronized with the Intel® XScale™ core clock and loaded into the LDIC_SR2. Once data is loaded into LDIC_SR2, the LDIC State Machine turns on and serially shifts the contents if LDIC_SR2 to the instruction cache.
Page 142: Ldic Cache Functions
10.13.3 LDIC Cache Functions The Intel® XScale™ core supports four cache functions that can be executed through JTAG. Two functions allow an external host to download code into the main instruction cache or the mini instruction cache through JTAG. Two additional functions are supported to allow lines to be invalidated in the instruction cache.
Page 143: Loading Ic During Reset
• LDIC mode: active when LDIC JTAG instruction is loaded in the JTAG IR; prevents the mini instruction cache and the main instruction cache from being invalidated during reset. Intel® XScale™ Microarchitecture User’s Manual 10-33...
Page 144: Loading Ic During Cold Reset For Debug
NOTE: In the Figure 10-11 hold_rst is a signal that gets set and cleared through JTAG When the JTAG IR contains the SELDCSR instruction, the hold_rst signal is set to the value scanned into DBG_SR[1]. 10-34 Intel® XScale™ Microarchitecture User’s Manual...
Page 145: Code Download During A Cold Reset For Debug
The Halt Mode bit must remain set to prevent the instruction cache from being invalidated. 9. When hold_rst is cleared, internal reset is de-asserted, and the processor executes the reset vector at address 0. Intel® XScale™ Microarchitecture User’s Manual 10-35...
Page 146: Loading Ic During A Warm Reset For Debug
In this last scenario, the mini instruction cache does not get invalidated by reset, since the processor is in Halt Mode. This scenario is described in more detail in this section. The last scenario described above is shown in Figure 10-12. 10-36 Intel® XScale™ Microarchitecture User’s Manual...
Page 147: Code Download During A Warm Reset For Debug
4) Place the LDIC JTAG instruction in the JTAG IR, then proceed with the normal code download, using the Invalidate IC Line function before loading each line. This requires 10 packets to be downloaded per cache line instead of the 9 packets as described in Section 10.13.3 Intel® XScale™ Microarchitecture User’s Manual 10-37...
Page 148: Dynamically Loading Ic After Reset
The description in this section focuses on using a debug handler running on the Intel® XScale™ core to synchronize with the external host, but the details apply for any application that is running while code is dynamically downloaded.
Page 149: Dynamic Code Download Synchronization
In a very simple debug handler stub, the above parts may form the complete handler downloaded during reset (with some handler entry and exit code). When a debug exception occurs, routines can be downloaded as necessary. This allows the entire handler to be dynamic. Intel® XScale™ Microarchitecture User’s Manual 10-39...
Page 150: Mini Instruction Cache Overview
JTAG LDIC function. Code downloaded into the mini instruction cache is essentially locked - it cannot be overwritten by application code running on the Intel® XScale™ core. It is not locked against code downloaded through the JTAG LDIC functions.
Page 151: Setting Up Override Vector Tables
One possibility is to set up vector traps on the non-reset exception vectors. These vector locations can then be used to extend the reset vector. Intel® XScale™ Microarchitecture User’s Manual 10-41...
Page 152: Implementing A Debug Handler
(non RRX) and MCR/MRC instructions, as a temporary scratch register. • The following instructions must not be executed in Debug Mode as they will result in unpredictable behavior: LDR w/ Rd=PC LDR w/ RRX addressing mode 10-42 Intel® XScale™ Microarchitecture User’s Manual...
Page 153: Dynamic Debug Handler
10.14.2.3 Dynamic Debug Handler On the Intel® XScale™ core, the debug handler and override vector tables may reside in the 2 KB mini instruction cache, separate from the main instruction cache. A “static” Debug Handler is downloaded during reset. This is the base handler code, necessary to do common operations such as handler entry/exit, parse commands from the debugger, read/write ARM* registers, read/write memory, etc.
Page 154: High-Speed Download
Using this assumption, the debugger does not have to poll RR to see whether the handler has read the previous data - it assumes the previous data has been consumed and immediately starts scanning in the next data word. 10-44 Intel® XScale™ Microarchitecture User’s Manual...
Page 155: Ending A Debug Session
2. turn off all breakpoints; 3. invalidate the mini instruction cache; 4. invalidate the main instruction cache; 5. invalidate the BTB; These actions ensure that the application program executes correctly after the debugger has been disconnected. Intel® XScale™ Microarchitecture User’s Manual 10-45...
Page 156: Software Debug Notes
However, in this specific case, the overflow flag does not get set, so the debugger is unaware that the download was not successful. 10-46 Intel® XScale™ Microarchitecture User’s Manual...
Page 157: Performance Considerations
The timings in this section are specific to the PXA255 processor, and how it implements the ARM* v5TE architecture. This is not a summary of all possible optimizations nor is it an explanation of the ARM* v5TE instruction set.
Page 158: Instruction Latencies
The load and store addressing modes implemented in the Intel® XScale™ core do not add to the instruction latencies numbers. The following section explains how to read these tables.
Page 159: Branch Instruction Timings
(stalled) umlal umlal 11.2.2 Branch Instruction Timings Table 11-3. Branch Instruction Timings (Those predicted by the BTB) Minimum Issue Latency when Correctly Minimum Issue Latency with Branch Mnemonic Predicted by the BTB Misprediction Intel® XScale™ Microarchitecture User’s Manual 11-3...
Page 160: Data Processing Instruction Timings
If the next instruction needs to use the result of the data processing for a shift by immediate or as Rn in a QDADD or QDSUB, one extra cycle of result latency is added to the number listed. 11-4 Intel® XScale™ Microarchitecture User’s Manual...
Page 161: Multiply Instruction Timings
Rs[31:15] = 0x00000 RdLo = 2; RdHi = 3 Rs[31:15] = 0x1FFFF Rs[31:27] = 0x00 RdLo = 3; RdHi = 4 SMULL Rs[31:27] = 0x1F RdLo = 4; RdHi = 5 all others SMULWy SMULxy Intel® XScale™ Microarchitecture User’s Manual 11-5...
Page 162: Saturated Arithmetic Instructions
If the next instruction needs to use the result of the MRA for a shift by immediate or as Rn in a QDADD or QDSUB, one extra cycle of result latency is added to the number listed. 11.2.5 Saturated Arithmetic Instructions 11-6 Intel® XScale™ Microarchitecture User’s Manual...
Page 163: Status Register Access Instructions
1 for writeback of base STRB 1 for writeback of base STRBT 1 for writeback of base STRD 1 for writeback of base STRH 1 for writeback of base STRT 1 for writeback of base Intel® XScale™ Microarchitecture User’s Manual 11-7...
Page 164: Semaphore Instructions
Minimum Result Latency Table 11-15. CP14 Register Access Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency 11.2.10 Miscellaneous Instruction Timing Table 11-16. SWI Instruction Timings Mnemonic Minimum latency to first instruction of SWI exception handler 11-8 Intel® XScale™ Microarchitecture User’s Manual...
Page 165: Thumb Instructions
Minimum Interrupt Latency is defined as the minimum number of cycles from the assertion of any interrupt signal (IRQ or FIQ) to the execution of the instruction at the vector for that interrupt. An active system responding to an interrupt will typically depend predominantly on the PXA255 processor’s internal & external bus activity.
Page 166 Performance Considerations 11-10 Intel® XScale™ Microarchitecture User’s Manual...
Page 167: Optimization Guide
Intel® XScale™ core architecture. It is written for developers who are optimizing compilers or performance analysis tools for the Intel® XScale™ core based processors. It can also be used by application developers to obtain the best performance from their assembly language code. The optimizations presented in this chapter are based on the Intel®...
Page 168: General Pipeline Characteristics
A.2.1.1. Number of Pipeline Stages The Intel® XScale™ core has a longer pipeline (7 stages versus 5 stages for StrongARM*) which operates at a much higher frequency than its predecessors do. This allows for greater overall performance. The longer Intel® XScale™ core pipeline has several negative consequences, however: •...
Page 169: Out Of Order Completion
A.2.1.5. Use of Bypassing The Intel® XScale™ core pipeline makes extensive use of bypassing to minimize data hazards. Bypassing allows results forwarding from multiple sources, eliminating the need to stall the pipeline. Intel® XScale™ Microarchitecture User’s Manual...
Page 170: Instruction Flow Through The Pipeline
The progress of an instruction can stall anywhere in the pipeline. Several pipestages may stall for various reasons. It is important to understand when and how hazards occur in the Intel® XScale™ core pipeline. Performance degradation could be significant if care is not taken to minimize pipeline stalls.
Page 171: Id (Instruction Decode) Pipestage
ALU calculation - the ALU performs arithmetic and logic operations, as required for data processing instructions and load/store index calculations. • Determine conditional instruction execution - The instruction’s condition is compared to the CPSR prior to execution of each instruction. Any instruction with a false condition is Intel® XScale™ Microarchitecture User’s Manual...
Page 172: X2 (Execute 2) Pipestage
Multiply/Multiply Accumulate (MAC) Pipeline The Multiply-Accumulate (MAC) unit executes all multiply and multiply-accumulate instructions supported by the Intel® XScale™ core. The MAC implements the 40-bit Intel® XScale™ core accumulator register acc0 and handles the instructions, which transfer its value to and from general-purpose ARM* registers.
Page 173: Behavioral Description
A.3.1 Conditional Instructions The Intel® XScale™ core architecture provides the ability to execute instructions conditionally. This feature combined with the ability of the Intel® XScale™ core instructions to modify the condition codes makes possible a wide array of optimizations. A.3.1.1.
Page 174: Optimizing Branches
The code generated above takes three cycles to execute the else part and four cycles for the if-part assuming best case conditions and no branch misprediction penalties. In the case of the Intel® XScale™ core, a branch misprediction incurs a penalty of four cycles. If the branch is mispredicted 50% of the time, and if we assume that both the if-part and the else-part are equally likely to be taken, on an average the code above takes 5.5 cycles to execute.
Page 175 Optimization Guide If we were to use the Intel® XScale™ core to execute instructions conditionally, the code generated for the above if-else statement is: r0, #10 movgt r0, #0 movle r0, #1 The above code segment would not incur any branch misprediction penalties and would take three cycles to execute assuming best case conditions.
Page 176: Optimizing Complex Expressions
The use of conditional instructions in the above fashion improves performance by minimizing the number of branches, thereby minimizing the penalties caused by branch mispredictions. This approach also reduces the utilization of branch prediction resources. A-10 Intel® XScale™ Microarchitecture User’s Manual...
Page 177: Bit Field Manipulation
Optimization Guide A.3.2 Bit Field Manipulation The Intel® XScale™ core shift and logical operations provide a useful way of manipulating bit fields. Bit field operations can be optimized as follows: ;Set the bit number specified by r1 in register r0...
Page 178: Effective Use Of Addressing Modes
A.3.5 Effective Use of Addressing Modes The Intel® XScale™ core provides a variety of addressing modes that make indexing an array of objects highly efficient. For a detailed description of these addressing modes please refer to the ARM* Architecture Reference Manual.
Page 179: Instruction Cache
A.4.1.1. Cache Miss Cost The Intel® XScale™ core performance is highly dependent on reducing the cache miss rate. Note that this cache miss penalty becomes significant when the core is running much faster than external memory. Executing non-cached instructions severely curtails the processor's performance in this case and it is very important to do everything possible to minimize cache misses.
Page 180: Data And Mini Cache
Data and Mini Caches A.4.2 Data and Mini Cache The Intel® XScale™ core allows the user to define memory regions whose cache policies can be set by the user (see Section 6.2.3, “Cache Policies”). Supported policies and configurations are: •...
Page 181: Read Allocate And Read-Write Allocate Memory Regions
Application performance can be improved by converting a part of the cache into on-chip RAM and allocating frequently used variables to it. Due to the Intel® XScale™ core round-robin replacement policy, all data will eventually be evicted. Therefore to prevent critical or frequently used data from being evicted it should be allocated to on-chip RAM.
Page 182: Data Alignment
In this case if tdata[] is not aligned to a cache line, then the prefetch using the address of tdata[i+1].ia may not include element id. If the array was aligned on a cache line + 12 bytes, then the prefetch would have to be placed on &tdata[i+1].id. A-16 Intel® XScale™ Microarchitecture User’s Manual...
Page 183: Literal Pools
A.4.2.7. Literal Pools The Intel® XScale™ core does not have a single instruction that can move all literals (a constant or address) to a register. One technique to load registers with literals in the Intel® XScale™ core is by loading the literal from a memory location that has been initialized with the constant or address.
Page 184: Memory Page Thrashing
Scheduling the prefetch instruction requires some understanding of the system latency times and system resources which affect when to use the prefetch instruction. For the PXA255 processor a cache line fill of 8 words from external memory will take more than 10 memory clocks, depending on external RAM speed and system timing configuration.
Page 185: Compute Vs. Data Bus Bound
A.4.4.5. Bandwidth Limitations Overuse of prefetches can usurp resources and degrade performance. This happens because once the bus traffic requests exceed the system resource capacity, the processor stalls. The Intel® XScale™ core data transfer resources are: 4 fill buffers 4 pending buffers...
Page 186: Cache Memory Considerations
Similarly rearranging sections of data structures so that sections often written fit in the same half cache line [16 bytes for the Intel® XScale™ core] can reduce cache eviction write- backs. On a global scale, techniques such as array merging can enhance the spatial locality of the data.
Page 187: Cache Blocking
This problem can be resolved by prefetch unrolling. For example consider: for(i=0; i<NMAX; i++) prefetch(data[i+2]); sum += data[i]; Intel® XScale™ Microarchitecture User’s Manual A-21...
Page 188: Pointer Prefetch
Note the order reversal of the prefetches in relationship to the usage. If there is a cache conflict and data is evicted from the cache then only the data from the first prefetch is lost. A-22 Intel® XScale™ Microarchitecture User’s Manual...
Page 189: Loop Interchange
However, the load ties up the receiving register until the data can be used. For example: r2, [r0] ; Process code { not yet cached latency > 30 core clocks } r1, r1, r2 Intel® XScale™ Microarchitecture User’s Manual A-23...
Page 190: Instruction Scheduling
Scheduling Loads On the Intel® XScale™ core, an LDR instruction has a result latency of 3 cycles assuming the data being loaded is in the data cache. If the instruction after the LDR needs to use the result of the load, then it would stall for 2 cycles.
Page 191 LSL #2 r9, r9, #0xf r8, r6, r8 r6, [sp], #4 r8, r8, #4 r8, r8, #0xf r1, r6, r7 r3, r6, r2 ; The value in register r6 is not used after this Intel® XScale™ Microarchitecture User’s Manual A-25...
Page 192: Scheduling Load And Store Double (Ldrd/Strd
; The value in register r6 is not used after this The Intel® XScale™ core has 4 fill-buffers that are used to fetch data from external memory when a data-cache miss occurs. The Intel® XScale™ core stalls when all fill buffers are in use. This happens when more than 4 loads are outstanding and are being fetched from memory.
Page 193: Scheduling Load And Store Multiple (Ldm/Stm
Similarly, the code sequence shown below takes 5 cycles to complete. r0, {r2, r3} r1, r1, #1 The alternative version which is shown below would only take 3 cycles to complete. strd r2, [r0] r1, r1, #1 Intel® XScale™ Microarchitecture User’s Manual A-27...
Page 194: Scheduling Data Processing Instructions
A.5.2 Scheduling Data Processing Instructions Most Intel® XScale™ core data processing instructions have a result latency of 1 cycle. This means that the current instruction is able to use the result from the previous data processing instruction. However, the result latency is 2 cycles if the current instruction needs to use the result of the previous data processing instruction for a shift by immediate.
Page 195: Scheduling Swp And Swpb Instructions
Similarly, the code shown below would incur a 2 cycle penalty due to the 3-cycle result latency for the second destination register. r6, r7, acc0 r1, r7 r0, r6 r2, r2, #1 Intel® XScale™ Microarchitecture User’s Manual A-29...
Page 196: Scheduling The Mia And Miaph Instructions
The MRS instruction has an issue latency of 1 cycle and a result latency of 2 cycles. The MSR instruction has an issue latency of 2 cycles (6 if updating the mode bits) and a result latency of 1 cycle. A-30 Intel® XScale™ Microarchitecture User’s Manual...
Page 197: Scheduling Coprocessor Instructions
Optimizing for smaller code size will, in general, lower the performance of your application. These are some techniques for optimizing for code size using the Intel® XScale™ core instruction set. Many optimizations mentioned in the previous chapters improve the performance of ARM* code.
Page 198: Use Of Pld Instructions
32-bit ARM* code. However, in some unusual cases where Instruction Cache size is a significant influence, being able to hold more Thumb instructions in cache may aid performance. Whatever the performance outcome, Thumb coding significantly reduces code size. A-32 Intel® XScale™ Microarchitecture User’s Manual...

Intel PXA255 User Manual

1 Introduction

2 Programming Model

3 Memory Management

4 Instruction Cache

5 Branch Target Buffer

6 Data Cache

7 Configuration

8 Performance Monitoring

9 Test

10 Software Debug

11 Performance Considerations

Quick Links

Need help?

Questions and answers

Related Manuals for Intel PXA255

Summary of Contents for Intel PXA255