Page 1 Intel XScale® Core Developer’s Manual January, 2004 Order Number: 273473-002...
Page 2 TokenExpress, Trillium, Vivonic, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. The ARM* and ARM Powered logo marks (the ARM marks) are trademarks of ARM, Ltd., and Intel uses these marks under license from ARM, Ltd. *Other names and brands may be claimed as the property of others.
Page 3: Table Of Contents
Introduction ............................ 13 About This Document ......................13 1.1.1 How to Read This Document ................. 13 1.1.2 Other Relevant Documents ................... 14 ® High-Level Overview of the Intel XScale Core..............15 1.2.1 ARM Compatibility ....................15 1.2.2 Features......................... 16 1.2.2.1 Multiply/Accumulate (MAC)..............16 1.2.2.2...
Page 4 Intel XScale® Core Developer’s Manual Contents 3.2.2.2 Cacheable (C), Bufferable (B), and eXtension (X) Bits......38 3.2.2.3 Instruction Cache ................... 38 3.2.2.4 Data Cache and Write Buffer ..............39 3.2.2.5 Details on Data Cache and Write Buffer Behavior......... 40 3.2.2.6 Memory Operation Ordering ..............
Page 5 Intel XScale® Core Developer’s Manual Contents 6.2.4 Round-Robin Replacement Algorithm ..............68 6.2.5 Parity Protection ....................68 6.2.6 Atomic Accesses ....................68 Data Cache and Mini-Data Cache Control .................69 6.3.1 Data Memory State After Reset ................69 6.3.2 Enabling/Disabling ....................69 6.3.3...
Page 6 Intel XScale® Core Developer’s Manual Contents 8.3.5 Overflow Flag Status Register (FLAG) ..............110 8.3.6 Event Select Register (EVTSEL) ................. 111 8.3.7 Managing the Performance Monitor ..............112 Performance Monitoring Events ..................113 8.4.1 Instruction Cache Efficiency Mode ..............115 8.4.2...
Page 7 Intel XScale® Core Developer’s Manual Contents 9.11.1.3 DCSR (DBG_SR[34:3])................ 140 9.11.2 DBGTX JTAG Register ..................141 9.11.2.1 DBG_SR[0] ..................141 9.11.2.2 TX (DBG_SR[34:3]) ................141 9.11.3 DBGRX JTAG Register ..................142 9.11.3.1 RX Write Logic ..................143 9.11.3.2 DBG_SR[0] ..................143 9.11.3.3 flush_rr ....................143 9.11.3.4 hs_download ..................143...
Page 8 Intel XScale® Core Developer’s Manual Contents A.2.1 General Pipeline Characteristics ................. 176 A.2.1.1. Number of Pipeline Stages ..............176 ® A.2.1.2. The Intel XScale Core Pipeline Organization ........177 A.2.1.3. Out Of Order Completion ..............178 A.2.1.4. Register Scoreboarding ............... 178 A.2.1.5.
Page 9 Intel XScale® Core Developer’s Manual Contents A.4.4.6. Bandwidth Limitations ................200 A.4.4.7. Cache Memory Considerations............201 A.4.4.8. Cache Blocking ..................203 A.4.4.9. Prefetch Unrolling ................203 A.4.4.10. Pointer Prefetch ...................204 A.4.4.11. Loop Interchange ................. 205 A.4.4.12. Loop Fusion ..................205 A.4.4.13.
Page 10 Intel XScale® Core Developer’s Manual Contents Figures 1-1 Architecture Features ......................... 16 3-1 Example of Locked Entries in TLB ..................... 45 4-1 Instruction Cache Organization ....................47 4-2 Locked Line Effect on Round Robin Replacement ..............54 5-1 BTB Entry ........................... 57 5-2 Branch History ..........................
Page 11 Intel XScale® Core Developer’s Manual Contents Tables 2-1 Multiply with Internal Accumulate Format ................... 24 2-2 MIA{<cond>} acc0, Rm, Rs ......................25 2-3 MIAPH{<cond>} acc0, Rm, Rs ....................25 2-4 MIAxy{<cond>} acc0, Rm, Rs..................... 26 2-5 Internal Accumulator Access Format..................27 2-6 MAR{<cond>} acc0, RdLo, RdHi ....................28 2-7 MRA{<cond>} RdLo, RdHi, acc0 ....................
Page 12 Intel XScale® Core Developer’s Manual Contents 8-6 Clock Count Register (CCNT) ....................106 8-7 Performance Monitor Count Register (PMN0 - PMN3) ............107 8-8 Performance Monitor Control Register ..................108 8-9 Interrupt Enable Register......................109 8-10 Overflow Flag Status Register ....................110 8-11 Event Select Register .......................
Page 13: Introduction
Intel retains the right to make changes to these specifications at any time, without notice. In particular, descriptions of features, timings, and pin-outs does not imply a commitment to implement them.
Page 14: Other Relevant Documents
This document describes Version 5TE of the ARM Architecture which includes Thumb ISA and ARM DSP-Enhanced ISA. (ISBN 0 201 737191) • StrongARM SA-1100 Microprocessor Developer’s Manual, Intel Order # 278105 • StrongARM SA-110 Microprocessor Technical Reference Manual, Intel Order #278104 January, 2004 Developer’s Manual...
Page 15: High-Level Overview Of The Intel Xscale ® Core
1.2.1 ARM Compatibility ARM Version 5 (V5) Architecture added floating point instructions to ARM Version 4. The Intel ® XScale core implements the integer instruction set architecture of ARM V5, but does not provide hardware support of the floating point instructions.
Page 16: Features
Intel XScale® Core Developer’s Manual Introduction 1.2.2 Features ® Figure 1-1 shows the major functional blocks of the Intel XScale core. The following sections give a brief, high-level overview of these blocks. Figure 1-1. Architecture Features Data Cache Mini- Instruction Cache •...
Page 17: Memory Management
Intel XScale® Core Developer’s Manual Introduction 1.2.2.2 Memory Management ® The Intel XScale core implements the Memory Management Unit (MMU) Architecture specified in the ARM Architecture Reference Manual. The MMU provides access protection and virtual to physical address translation. The MMU Architecture also specifies the caching policies for the instruction cache and data memory.
Page 18: Performance Monitoring
1.2.2.6 Performance Monitoring ® Performance monitoring counters have been added to the Intel XScale core that can be configured to monitor various events in the core. These events allow a software developer to measure cache efficiency, detect system bottlenecks and reduce the overall latency of programs.
Page 19: Terminology And Conventions
Once an entry is flushed in the cache it can no longer be used by the program. ® XSC1 XSC1 refers to a variant of the Intel XScale core denoted by a CoreGen (Coprocessor 15, ID Register) value of 0x1. This variant has a 2 counter performance monitor and a 5-bit JTAG instruction register. See Table 7-4, “ID Register”...
Page 20 Intel XScale® Core Developer’s Manual Introduction This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Page 21: Programming Model
Intel XScale® Core Developer’s Manual Programming Model Programming Model ® This chapter describes the programming model of the Intel XScale core, namely the implementation options and extensions to the ARM Version 5TE architecture. ARM Architecture Compatibility ® The Intel XScale core implements the integer instruction set architecture specified in ARM V5TE.
Page 22: Arm Dsp-Enhanced Instruction Set
Section 2.3.1.2 for more information. Access ® to coprocessors 15 and 14 generate an undefined instruction exception. Refer to the Intel XScale core implementation option section of the ASSP architecture specification for the behavior when accessing all other coprocessors. 2.2.5...
Page 23: Extensions To Arm Architecture
CP0. If this is the case, a ® complete definition can be found in the Intel XScale core implementation option section of the ASSP architecture specification. For this very reason, software should not rely on behavior that is specific to the 40-bit length of the accumulator, since the length may be extended.
Page 24: Multiply With Internal Accumulate Format
Rm - Multiplicand Two new fields were created for this format, acc and opcode_3. The acc field specifies 1 of 8 internal accumulators to operate on and opcode_3 defines the operation for this format. The Intel ® XScale core defines a single 40-bit accumulator referred to as acc0; future implementations may ®...
Page 25: Mia{} Acc0, Rm, Rs
Intel XScale® Core Developer’s Manual Programming Model Table 2-2. MIA{<cond>} acc0, Rm, Rs 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 cond 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 Operation: if ConditionPassed(<cond>) then...
Page 26: Miaxy{} Acc0, Rm, Rs
Intel XScale® Core Developer’s Manual Programming Model Table 2-4. MIAxy{<cond>} acc0, Rm, Rs 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 cond 1 1 1 0 0 0 1 0 1 1 x y 0 0 0 0 0 0 0 1 Operation: if ConditionPassed(<cond>) then...
Page 27: Internal Accumulator Access Format
Intel XScale® Core Developer’s Manual Programming Model 2.3.1.2 Internal Accumulator Access Format ® The Intel XScale core defines a new instruction format for accessing internal accumulators in CP0. Table 2-5, “Internal Accumulator Access Format” on page 2-27 shows that the opcode falls into the coprocessor register transfer space.
Page 28: Mar{} Acc0, Rdlo, Rdhi
Intel XScale® Core Developer’s Manual Programming Model Table 2-6. MAR{<cond>} acc0, RdLo, RdHi 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 cond 1 1 0 0 0 1 0 0...
Page 29: New
P bit in the first level descriptors to allow an ASSP to identify a ® new memory attribute. Refer to the Intel XScale core implementation option section of the ASSP architecture specification to find out how the P bit has been defined. Bit 1 in the Control Register (coprocessor 15, register 1, opcode=1) is used to assigned the P bit memory attribute for memory accesses made during page table walks.
Page 30: Second-Level Descriptors For Coarse
Intel XScale® Core Developer’s Manual Programming Model Table 2-8. First-level Descriptors 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 Coarse page table base address Domain...
Page 31: Additions To Cp15 Functionality
Intel XScale® Core Developer’s Manual Programming Model 2.3.3 Additions to CP15 Functionality ® To accommodate the functionality in the Intel XScale core, registers in CP15 and CP14 have been added or augmented. See Chapter 7, “Configuration” for details. At times it is necessary to be able to guarantee exactly when a CP15 update takes effect. For example, when enabling memory address translation (turning on the MMU), it is vital to know when the MMU is actually guaranteed to be in operation.
Page 32: Event Architecture
Intel XScale® Core Developer’s Manual Programming Model 2.3.4 Event Architecture 2.3.4.1 Exception Summary Table 2-11 shows all the exceptions that the core may generate, and the attributes of each. Subsequent sections give details on each exception. Table 2-11. Exception Summary...
Page 33: Prefetch Aborts
Intel XScale® Core Developer’s Manual Programming Model 2.3.4.3 Prefetch Aborts ® The Intel XScale core detects three types of prefetch aborts: Instruction MMU abort, external abort on an instruction access, and an instruction cache parity error. These aborts are described in Table 2-13.
Page 34: Data Aborts
2.3.4.4 Data Aborts ® Two types of data aborts exist in the Intel XScale core: precise and imprecise. A precise data abort is defined as one where R14_ABORT always contains the PC (+8) of the instruction that caused the exception. An imprecise abort is one where R14_ABORT contains the PC (+4) of the next instruction to execute and not the address of the instruction that caused the abort.
Page 35: Events From Preload Instructions
Intel XScale® Core Developer’s Manual Programming Model Although the core guarantees the Base Restored Abort Model for precise aborts, it cannot do so in the case of imprecise aborts. A Data Abort handler may encounter an updated base register if it is invoked because of an imprecise abort.
Page 36: Debug Events
Intel XScale® Core Developer’s Manual Programming Model This feature allows software to issue PLDs speculatively. For example, Example 2-3 on page 2-36 places a PLD instruction early in the loop. This PLD is used to fetch data for the next loop iteration.
Page 37: Memory Management
Intel XScale® Core Developer’s Manual Memory Management Memory Management ® This chapter describes the memory management unit implemented in the Intel XScale core. Overview ® The Intel XScale core implements the Memory Management Unit (MMU) Architecture specified in the ARM Architecture Reference Manual. To accelerate virtual to physical address translation, the core uses both an instruction Translation Look-aside Buffer (TLB) and a data TLB to cache the latest translations.
Page 38: Architecture Model
The P bit allows an ASSP to assign its own page attribute to a memory region. This bit is only ® present in the first level descriptors. Refer to the Intel XScale core implementation section of the ASSP architecture specification to find out how this has been defined. Accesses to memory for page table walks do not use the MMU.
Page 39: Data Cache And Write Buffer
Intel XScale® Core Developer’s Manual Memory Management 3.2.2.4 Data Cache and Write Buffer All of these descriptor bits affect the behavior of the Data Cache and the Write Buffer. If the X bit for a descriptor is zero, the C and B bits operate as mandated by the ARM architecture.
Page 40: Details On Data Cache And Write Buffer Behavior
Intel XScale® Core Developer’s Manual Memory Management 3.2.2.5 Details on Data Cache and Write Buffer Behavior If the MMU is disabled all data accesses will be non-cacheable and non-bufferable. This is the same behavior as when the MMU is enabled, and a data access uses a descriptor with X, C, and B all set to 0.
Page 41: Interaction Of The Mmu, Instruction Cache, And Data Cache
Intel XScale® Core Developer’s Manual Memory Management Interaction of the MMU, Instruction Cache, and Data Cache The MMU, instruction cache, and data/mini-data cache may be enabled/disabled independently. The instruction cache can be enabled with the MMU enabled or disabled. However, the data cache can only be enabled when the MMU is enabled.
Page 42: Control
Intel XScale® Core Developer’s Manual Memory Management Control 3.4.1 Invalidate (Flush) Operation The entire instruction and data TLB can be invalidated at the same time with one command or they can be invalidated separately. An individual entry in the data or instruction TLB can also be invalidated.
Page 43: Locking Entries
Intel XScale® Core Developer’s Manual Memory Management 3.4.3 Locking Entries Individual entries can be locked into the instruction and data TLBs. See Table 7-14, “Cache Lockdown Functions” on page 7-90 for the exact commands. If a lock operation finds the virtual address translation already resident in the TLB, the results are unpredictable.
Page 44 Intel XScale® Core Developer’s Manual Memory Management The proper procedure for locking entries into the data TLB is shown in Example 3-3 on page 3-44. Example 3-3. Locking Entries into the Data TLB ; R1, and R2 contain the virtual addresses to translate and lock into the data TLB P15,0,R1,C8,C6,1 ;...
Page 45: Round-Robin Replacement Algorithm
Intel XScale® Core Developer’s Manual Memory Management 3.4.4 Round-Robin Replacement Algorithm The line replacement algorithm for the TLBs is round-robin; there is a round-robin pointer that keeps track of the next entry to replace. The next entry to replace is the one sequentially after the last entry that was written.
Page 46 Intel XScale® Core Developer’s Manual Memory Management This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Page 47: Instruction Cache
Intel XScale® Core Developer’s Manual Instruction Cache Instruction Cache ® The Intel XScale core instruction cache enhances performance by reducing the number of instruction fetches from external memory. The cache provides fast execution of cached code. Code can also be locked down when guaranteed or fast access time is required.
Page 48: Operation
Intel XScale® Core Developer’s Manual Instruction Cache Operation 4.2.1 Operation When Instruction Cache is Enabled When the cache is enabled, it compares every instruction request address against the addresses of instructions that it is currently holding. If the cache contains the requested instruction, the access “hits”...
Page 49: Fetch Policy
Intel XScale® Core Developer’s Manual Instruction Cache 4.2.3 Fetch Policy An instruction-cache “miss” occurs when the requested instruction is not found in the instruction fetch buffers or instruction cache; a fetch request is then made to external memory. The instruction cache can handle up to two “misses.”...
Page 50: Parity Protection
Intel XScale® Core Developer’s Manual Instruction Cache 4.2.5 Parity Protection The instruction cache is protected by parity to ensure data integrity. Each instruction cache word has 1 parity bit. (The instruction cache tag is NOT parity protected.) When a parity error is detected on an instruction cache access, a prefetch abort exception occurs if the core attempts to execute the instruction.
Page 51: Instruction Fetch Latency
4.2.6 Instruction Fetch Latency The instruction fetch latency is dependent on the core to memory frequency ratio, system bus bandwidth, system memory, etc., which are all particular to each ASSP. So, refer to the Intel ® XScale core implementation option section of the ASSP architecture specification for exact details on instruction fetch latency.
Page 52: Instruction Cache Control
Intel XScale® Core Developer’s Manual Instruction Cache Instruction Cache Control 4.3.1 Instruction Cache State at RESET After reset, the instruction cache is always disabled, unlocked, and invalidated (flushed). 4.3.2 Enabling/Disabling The instruction cache is enabled by setting bit 12 in coprocessor 15, register 1 (Control Register).
Page 53: Invalidating The Instruction Cache
Intel XScale® Core Developer’s Manual Instruction Cache 4.3.3 Invalidating the Instruction Cache The entire instruction cache along with the fetch buffers are invalidated by writing to coprocessor 15, register 7. (See Table 7-12, “Cache Functions” on page 7-87 for the exact command.) This command does not unlock any lines that were locked in the instruction cache nor...
Page 54: Locking Instructions In The Instruction Cache
Intel XScale® Core Developer’s Manual Instruction Cache 4.3.4 Locking Instructions in the Instruction Cache Software has the ability to lock performance critical routines into the instruction cache. Up to 28 lines in each set can be locked; hardware will ignore the lock command if software is trying to lock all the lines in a particular set (i.e., ways 28-31can never be locked).
Page 55: Unlocking Instructions In The Instruction Cache
Intel XScale® Core Developer’s Manual Instruction Cache Software can lock down several different routines located at different memory locations. This may cause some sets to have more locked lines than others as shown in Figure 4-2. Example 4-4 on page 4-55 shows how a routine, called “lockMe”...
Page 56 Intel XScale® Core Developer’s Manual Instruction Cache This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Page 57: Branch Target Buffer
Intel XScale® Core Developer’s Manual Branch Target Buffer Branch Target Buffer ® The Intel XScale core uses dynamic branch prediction to reduce the penalties associated with changing the flow of program execution. The core features a branch target buffer that provides the instruction cache with the target address of branch type instructions.
Page 58: Reset
Intel XScale® Core Developer’s Manual Branch Target Buffer The history bits represent four possible prediction states for a branch entry in the BTB. Figure 5-2, “Branch History” on page 5-58 shows these states along with the possible transitions. The initial state for branches stored in the BTB is Weakly-Taken (WT).
Page 59: Btb Control
Intel XScale® Core Developer’s Manual Branch Target Buffer BTB Control 5.2.1 Disabling/Enabling The BTB is always disabled with Reset. Software can enable the BTB through a bit in a coprocessor register (see Section 7.2.2). Before enabling or disabling the BTB, software must invalidate it (described in the following section).
Page 60 Intel XScale® Core Developer’s Manual Branch Target Buffer This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Page 61: Data Cache
Intel XScale® Core Developer’s Manual Data Cache Data Cache ® The Intel XScale core data cache enhances performance by reducing the number of data accesses to and from external memory. There are two data cache structures in the core, a data cache with two...
Page 62: Data Cache Organization
Intel XScale® Core Developer’s Manual Data Cache Figure 6-1. Data Cache Organization Set 31 Example: 32 Kbyte cache way 0 32 bytes (cache line) way 1 Set Index DATA Set 1 way 0 32 bytes (cache line) Set 0 way 1...
Page 63: Mini-Data Cache Overview
Intel XScale® Core Developer’s Manual Data Cache 6.1.2 Mini-Data Cache Overview The mini-data cache is 1/16 the size of the data cache, so depending on the data cache size selected the available sizes are 2 K or 1 Kbytes. The 2 Kbyte version has 32 sets and the 1 Kbyte version has 16 sets;...
Page 64: Write Buffer And Fill Buffer Overview
Intel XScale® Core Developer’s Manual Data Cache 6.1.3 Write Buffer and Fill Buffer Overview ® The Intel XScale core employs an eight entry write buffer, each entry containing 16 bytes. Stores to external memory are first placed in the write buffer and subsequently taken out when the bus is available.
Page 65: Data Cache And Mini-Data Cache Operation
Intel XScale® Core Developer’s Manual Data Cache Data Cache and Mini-Data Cache Operation The following discussions refer to the data cache and mini-data cache as one cache (data/mini-data) since their behavior is the same when accessed. 6.2.1 Operation When Caching is Enabled When the data/mini-data cache is enabled for an access, the data/mini-data cache compares the address of the request against the addresses of data that it is currently holding.
Page 66: Read Miss Policy
Intel XScale® Core Developer’s Manual Data Cache 6.2.3.2 Read Miss Policy The following sequence of events occurs when a cacheable (see Section 6.2.3.1, “Cacheability” on page 6-65) load operation misses the cache: 1. The fill buffer is checked to see if an outstanding fill request already exists for that line.
Page 67: Write Miss Policy
Intel XScale® Core Developer’s Manual Data Cache 6.2.3.3 Write Miss Policy A write operation that misses the cache will request a 32-byte cache line from external memory if the access is cacheable and write allocation is specified in the page. In this case the following sequence of events occur: 1.
Page 68: Round-Robin Replacement Algorithm
Intel XScale® Core Developer’s Manual Data Cache 6.2.4 Round-Robin Replacement Algorithm The line replacement algorithm for the data cache is round-robin. Each set in the data cache has a round-robin pointer that keeps track of the next line (in that set) to replace. The next line to replace in a set is the next sequential line after the last one that was just filled.
Page 69: Data Cache And Mini-Data Cache Control
Intel XScale® Core Developer’s Manual Data Cache Data Cache and Mini-Data Cache Control 6.3.1 Data Memory State After Reset After processor reset, both the data cache and mini-data cache are disabled, all valid bits are set to zero (invalid), and the round-robin bit points to way 31. Any lines in the data cache that were configured as data RAM before reset are changed back to cacheable lines after reset, i.e., there are...
Page 70: Global Clean And Invalidate Operation
Intel XScale® Core Developer’s Manual Data Cache 6.3.3.1 Global Clean and Invalidate Operation A simple software routine is used to globally clean the data cache. It takes advantage of the line-allocate data cache operation, which allocates a line into the data cache. This allocation evicts any cache dirty data back to external memory.
Page 71: Re-Configuring The Data Cache As Data Ram
Intel XScale® Core Developer’s Manual Data Cache Re-configuring the Data Cache as Data RAM Software has the ability to lock tags associated with 32-byte lines in the data cache, thus creating the appearance of data RAM. Any subsequent access to this line will always hit the cache unless it is invalidated.
Page 72 Intel XScale® Core Developer’s Manual Data Cache Example 6-3. Locking Data into the Data Cache ; R1 contains the virtual address of a region of memory to lock, ; configured with C=1 and B=1 ; R0 is the number of 32-byte lines to lock into the data cache. In this ;...
Page 73 Intel XScale® Core Developer’s Manual Data Cache Example 6-4. Creating Data RAM ; R1 contains the virtual address of a region of memory to configure as data RAM, ; which is aligned on a 32-byte boundary. ; MMU is configured so that the memory region is cacheable.
Page 74: Locked Line Effect On Round Robin Replacement
Intel XScale® Core Developer’s Manual Data Cache Tags can be locked into the data cache by enabling the data cache lock mode bit located in coprocessor 15, register 9. (See Table 7-14, “Cache Lockdown Functions” on page 7-90 for the exact command.) Once enabled, any new lines allocated into the data cache will be locked down.
Page 75: Write Buffer/Fill Buffer Operation And Control
Note that an ASSP may ® also include operations external to the core in the drain operation. (Refer to the Intel XScale core implementation option section in the ASSP architecture specification for more details.) See Table 7-12, “Cache Functions”...
Page 76 Intel XScale® Core Developer’s Manual Data Cache This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Page 77: Configuration
Any access to CP14 in user mode will cause an undefined instruction exception. ® Coprocessors, CP15 and CP14, on the Intel XScale core do not support access via CDP, MRRC, or MCRR instructions. An attempt to access these coprocessors with these instructions will result in an undefined instruction exception.
Page 78: Mrc/Mcr Format
0b1111 = CP15 0b1110 = CP14 0x0000 = CP0 11:8 cp_num - coprocessor number ® NOTE: Refer to the Intel XScale core implementation option section of the ASSP architecture specification to see if there are any other coprocessors defined by the ASSP.
Page 79: Ldc/Stc Format When Accessing Cp14
® The Intel XScale core defines the following: 0b1111 = Undefined Exception 0b1110 = CP14 ® NOTE: Refer to the Intel XScale core 11:8 cp_num - coprocessor number implementation option section of the ASSP architecture specification to find out the meaning of the other encodings.
Page 80: Cp15 Registers
Intel XScale® Core Developer’s Manual Configuration CP15 Registers ® Table 7-3 lists the CP15 registers implemented in the Intel XScale core. Table 7-3. CP15 Registers Register Opc_1 Opc_2 Access Description (CRn) Read / Write-Ignored Read / Write-Ignored Cache Type Read / Write...
Page 81: Register 0: Id & Cache Type Registers
The ID Register is selected when opcode_2=0. This register returns the code for the ASSP, where a ® portion of it is defined by the ASSP. Refer to the Intel XScale core implementation option section of the ASSP architecture specification for the exact encoding.
Page 82: Cache Type Register
Intel XScale® Core Developer’s Manual Configuration Table 7-5. Cache Type Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 0 0 0 0 1 0 1 1 0 0 0...
Page 83: Register 1: Control & Auxiliary Control Registers
Intel XScale® Core Developer’s Manual Configuration 7.2.2 Register 1: Control & Auxiliary Control Registers Register 1 is made up of two registers, one that is compliant with ARM Version 5TE and referred by opcode_2 = 0x0, and the other which is specific to the core is referred by opcode_2 = 0x1. The latter is known as the Auxiliary Control Register.
Page 84: Auxiliary Control Register
Read-Unpredictable / Reserved Write-as-Zero Page Table Memory Attribute (P) This field is defined by ® the ASSP. Refer to the Intel XScale core implementation Read / Write option section of the ASSP architecture specification for more information. Write Buffer Coalescing Disable (K)
Page 85: Register 2: Translation Table Base Register
Intel XScale® Core Developer’s Manual Configuration 7.2.3 Register 2: Translation Table Base Register Table 7-8. Translation Table Base Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9...
Page 86: Register 5: Fault Status Register
Intel XScale® Core Developer’s Manual Configuration 7.2.6 Register 5: Fault Status Register The Fault Status Register (FSR) indicates which fault has occurred, which could be either a prefetch abort or a data abort. Bit 10 extends the encoding of the status field for prefetch aborts and data aborts.
Page 87: Register 7: Cache Functions
Intel XScale® Core Developer’s Manual Configuration 7.2.8 Register 7: Cache Functions This register should be accessed as write-only. Reads from this register, as with an MRC, have an undefined effect. The Drain Write Buffer function not only drains the write buffer but also drains the fill buffer.The core does not check permissions on addresses supplied for cache or TLB functions.
Page 88 Intel XScale® Core Developer’s Manual Configuration Other items to note about the line-allocate command are: • It forces all pending memory operations to complete. • Bits [31:5] of Rd is used to specific the virtual address of the line to allocated into the data cache.
Page 89: Register 8: Tlb Operations
Intel XScale® Core Developer’s Manual Configuration 7.2.9 Register 8: TLB Operations Disabling/enabling the MMU has no effect on the contents of either TLB: valid entries stay valid, locked items remain locked. All operations defined in Table 7-13 work regardless of whether the TLB is enabled or disabled.
Page 90: Register 9: Cache Lock Down
Intel XScale® Core Developer’s Manual Configuration 7.2.10 Register 9: Cache Lock Down Register 9 is used for locking down entries into the instruction cache and data cache. (The protocol for locking down entries can be found in Chapter 6, “Data Cache”.)
Page 91: Register 10: Tlb Lock Down
Intel XScale® Core Developer’s Manual Configuration 7.2.11 Register 10: TLB Lock Down Register 10 is used for locking down entries into the instruction TLB, and data TLB. (The protocol for locking down entries can be found in Chapter 3, “Memory Management”.) Lock/unlock...
Page 92: The Pid Register Affect On Addresses
Intel XScale® Core Developer’s Manual Configuration 7.2.13.1 The PID Register Affect On Addresses All addresses generated and used by User Mode code are eligible for being “PIDified” as described in the previous section. Privileged code, however, must be aware of certain special cases in which address generation does not follow the usual flow.
Page 93: Register 14: Breakpoint Registers
(DBR0), one configurable data mask/address register (DBR1), and one data breakpoint control register (DBCON). ® Refer to Chapter 9, “Software Debug” for more information on these features of the Intel XScale core. Table 7-19. Accessing the Debug Registers Function...
Page 94: Register 15: Coprocessor Access Register
This register controls access to CP0 and other coprocessors (CP1 through CP13) that may exist in ® an ASSP. (See the Intel XScale core implementation option section of the ASSP architecture specification for a list of coprocessors that may have been implemented.) A typical use for this register is for an operating system to control resource sharing among applications.
Page 95: Coprocessor Access Register
Read-as-Zero/Write-as-Zero compatibility Coprocessor Access Rights - Each bit in this field corresponds to the access rights for ® each coprocessor. Refer to the Intel XScale core 13:1 Read / Write implementation option section of the ASSP architecture specification to find out which, if any, coprocessors exist and for the definition of these bits.
Page 96: Cp14 Registers
Intel XScale® Core Developer’s Manual Configuration CP14 Registers CP14 contains software debug registers, clock and power management registers and the performance monitor registers. All other registers are reserved in CP14. Reading and writing them yields unpredictable results. 7.3.1 Performance Monitoring Registers There are two variants of the performance monitoring facility;...
Page 97: Xsc2 Performance Monitoring Registers
Intel XScale® Core Developer’s Manual Configuration 7.3.1.2 XSC2 Performance Monitoring Registers The performance monitoring unit in XSC2 contains a control register (PMNC), a clock counter (CCNT), interrupt enable register (INTEN), overflow flag register (FLAG), event selection register (EVTSEL) and four event counters (PMN0 through PMN3). The format of these registers can be found in Chapter 8, “Performance...
Page 98: Clock And Power Management Registers
= 0x0). This function informs the clocking unit (located external to the core) to change core clock frequency. Software can read CCLKCFG to determine current operating frequency. Exact ® definition of this register can be found in the Intel XScale core implementation option section of the ASSP architecture specification.
Page 99: Software Debug Registers
Intel XScale® Core Developer’s Manual Configuration 7.3.3 Software Debug Registers Software debug is supported by address breakpoint registers (Coprocessor 15, register 14), serial communication over the JTAG interface and a trace buffer. Registers 8, 9 and 14 are used for the serial interface, register 10 is for general control and registers 11 through 13 support a 256 entry trace buffer.
Page 100 Intel XScale® Core Developer’s Manual Configuration This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Page 101: Performance Monitoring
If any of the counters overflow, an interrupt request will occur if it’s enabled. (What happens to the interrupt request is definable by the ASSP, which typically contains an interrupt controller that handles priority, masking, steering to FIQ or IRQ, etc. Refer to the Intel ®...
Page 102: Xsc1 Register Description (2 Counter Variant)
Intel XScale® Core Developer’s Manual Performance Monitoring XSC1 Register Description (2 counter variant) Table 8-1 contains details on accessing these registers with MRC and MCR coprocessor instructions. Table 8-1. XSC1 Performance Monitoring Registers Description Instruction Register# Register# (PMNC) Performance Monitor Control...
Page 103: Performance Count Registers (Pmn0 - Pmn1; Cp14 - Register 2 And 3, Respectively)
Intel XScale® Core Developer’s Manual Performance Monitoring 8.2.2 Performance Count Registers (PMN0 - PMN1; CP14 - Register 2 and 3, Respectively) There are two 32-bit event counters; their format is shown in Table 8-7. The event counters are reset to ‘0’ by the PMNC register or can be set to a predetermined value by directly writing to them.
Page 104: Performance Monitor Control Register (Cp14, Register 0)
Intel XScale® Core Developer’s Manual Performance Monitoring Table 8-4. Performance Monitor Control Register (CP14, register 0) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 evtCount1...
Page 105: Managing Pmnc
Intel XScale® Core Developer’s Manual Performance Monitoring 8.2.4.1 Managing PMNC The following are a few notes about controlling the performance monitoring mechanism: • An interrupt will be reported when a counter’s overflow flag is set and its associated interrupt enable bit is set in the PMNC register. The interrupt will remain asserted until software clears the overflow flag by writing a one to the flag that is set.
Page 106: Xsc2 Register Description (4 Counter Variant)
Intel XScale® Core Developer’s Manual Performance Monitoring XSC2 Register Description (4 counter variant) Table 8-5 contains details on accessing these registers with MRC and MCR coprocessor instructions. Table 8-5. Performance Monitoring Registers Description Instruction Register# Register# (PMNC) Performance Monitor Control...
Page 107: Performance Count Registers (Pmn0 - Pmn3)
Intel XScale® Core Developer’s Manual Performance Monitoring 8.3.2 Performance Count Registers (PMN0 - PMN3) There are four 32-bit event counters; their format is shown in Table 8-7. The event counters are reset to ‘0’ by setting bit 1 in the PMNC register or can be set to a predetermined value by directly writing to them.
Page 108: Performance Monitor Control Register (Pmnc)
Intel XScale® Core Developer’s Manual Performance Monitoring 8.3.3 Performance Monitor Control Register (PMNC) The performance monitor control register (PMNC) is a coprocessor register that: • contains the PMU ID • extends CCNT counting by six more bits (cycles between counter rollover = 2 •...
Page 109: Interrupt Enable Register (Inten)
Intel XScale® Core Developer’s Manual Performance Monitoring 8.3.4 Interrupt Enable Register (INTEN) Each counter can generate an interrupt request when it overflows. INTEN enables interrupt requesting for each counter. Table 8-9. Interrupt Enable Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9...
Page 110: Overflow Flag Status Register (Flag)
Intel XScale® Core Developer’s Manual Performance Monitoring 8.3.5 Overflow Flag Status Register (FLAG) FLAG identifies which counter has overflowed and also indicates an interrupt has been requested if the overflowing counter’s corresponding interrupt enable bit (contained within INTEN) is asserted.
Page 111: Event Select Register (Evtsel)
Intel XScale® Core Developer’s Manual Performance Monitoring 8.3.6 Event Select Register (EVTSEL) EVTSEL is used to select events for PMN0, PMN1, PMN2 and PMN3. Refer to Table 8-12, “Performance Monitoring Events” on page 8-113 for a list of possible events.
Page 112: Managing The Performance Monitor
Intel XScale® Core Developer’s Manual Performance Monitoring 8.3.7 Managing the Performance Monitor The following are a few notes about controlling the performance monitoring mechanism: • An interrupt request will be generated when a counter’s overflow flag is set and its associated interrupt enable bit is set in INTEN.
Page 113: Performance Monitoring Events
PC changes to the event address, e.g., IRQ, FIQ, SWI, etc. ® 0x10 through Defined by ASSP. See the Intel XScale core implementation option section of the ASSP 0x17 architecture specification for more details.
Page 114: Some Common Uses Of The Pmu
Intel XScale® Core Developer’s Manual Performance Monitoring Some typical combinations of counted events are listed in this section and summarized in Table 8-13. In this section, we call such an event combination a mode. Table 8-13. Some Common Uses of the PMU...
Page 115: Instruction Cache Efficiency Mode
Intel XScale® Core Developer’s Manual Performance Monitoring 8.4.1 Instruction Cache Efficiency Mode PMN0 totals the number of instructions that were executed, which does not include instructions fetched from the instruction cache that were never executed. This can happen if a branch instruction changes the program flow;...
Page 116: Data/Bus Request Buffer Full Mode
This is calculated by dividing PMN0 by PMN1. This statistic lets you know if the duration event cycles are due to many requests or are attributed to just a few ® requests. If the average is high then the Intel XScale core may be starved of the external bus. •...
Page 117: Instruction Tlb Efficiency Mode
Intel XScale® Core Developer’s Manual Performance Monitoring 8.4.6 Instruction TLB Efficiency Mode PMN0 totals the number of instructions that were executed, which does not include instructions that were translated by the instruction TLB and never executed. This can happen if a branch instruction changes the program flow;...
Page 118: Multiple Performance Monitoring Run Statistics
Intel XScale® Core Developer’s Manual Performance Monitoring Multiple Performance Monitoring Run Statistics There may be times when the number of events to be monitored exceed the number of counters. In this case, multiple performance monitoring runs can be done, capturing different events from each run.
Page 119: Examples
Intel XScale® Core Developer’s Manual Performance Monitoring Examples The same example is shown below for both variants (XSC1 and XSC2). 8.6.1 XSC1 Example (2 counter variant) In this example, the events selected with the Instruction Cache Efficiency mode are monitored and CCNT is used to measure total execution time.
Page 120: Xsc2 Example (4 Counter Variant)
Intel XScale® Core Developer’s Manual Performance Monitoring 8.6.2 XSC2 Example (4 counter variant) In this example, the events selected with the Instruction Cache Efficiency mode are monitored and CCNT is used to measure total execution time. Sampling time ends when PMN0 overflows which will generate an IRQ interrupt.
Page 121: Software Debug
Intel XScale® Core Developer’s Manual Software Debug Software Debug This chapter describes the software debug and related features implemented in Elkhart, namely: • debug modes, registers and exceptions. • a serial debug communication link via the JTAG interface. • a trace buffer.
Page 122: Introduction
Intel XScale® Core Developer’s Manual Software Debug Introduction The Elkhart debug unit, when used with a debugger application, allows software running on an Elkhart target to be debugged. The debug unit allows the debugger to stop program execution and re-direct execution to a debug handling routine. Once program execution has stopped, the debugger can examine or modify processor state, co-processor state, or memory.
Page 123: Debug Control And Status Register (Dcsr)
Intel XScale® Core Developer’s Manual Software Debug Debug Control and Status Register (DCSR) The DCSR register is the main control register for the debug unit. Table 9-1 shows the format of the register. The DCSR register can be accessed in privileged modes by software running on the core or by a debugger through the JTAG interface.
Page 124: Global Enable Bit (Ge)
SOC Break (B) ® Reading the SOC Break bit returns the value of the SOC break input into the Intel XScale core Use of the SOC break input to the core (used to generate SOC debug breaks) is product specific and is targeted towards chips that need system-on-a-chip debug capabilities.
Page 125: Vector Trap Bits (Tf,Ti,Td,Ta,Ts,Tu,Tr)
Intel XScale® Core Developer’s Manual Software Debug 9.4.4 Vector Trap Bits (TF,TI,TD,TA,TS,TU,TR) The Vector Trap bits allow instruction breakpoints to be set on exception vectors without using up any of the breakpoint registers. When a bit is set, it acts as if an instruction breakpoint was set up on the corresponding exception vector.
Page 126: Debug Exceptions
Intel XScale® Core Developer’s Manual Software Debug Debug Exceptions A debug exception causes the processor to re-direct execution to a debug event handling routine. The Elkhart debug architecture defines the following debug exceptions: • instruction breakpoint • data breakpoint •...
Page 127: Halt Mode
Intel XScale® Core Developer’s Manual Software Debug 9.5.1 Halt Mode The debugger turns on Halt Mode through the JTAG interface by scanning in a value that sets the bit in DCSR. The debugger turns off Halt Mode through JTAG, either by scanning in a new DCSR value or by a TRST.
Page 128 Intel XScale® Core Developer’s Manual Software Debug Following a debug exception, the processor switches to debug mode and enters SDS, which allows the following special functionality: • All events are disabled. SWI or undefined instructions have unpredictable results. The processor ignores pre-fetch aborts, FIQ and IRQ (SDS disables FIQ and IRQ regardless of the enable values in the CPSR).
Page 129: Monitor Mode
Intel XScale® Core Developer’s Manual Software Debug 9.5.2 Monitor Mode In Monitor Mode, the processor handles debug exceptions like normal ARM exceptions, except for SOC debug breaks, which are handled like Halt Mode exceptions. If debug functionality is enabled and the processor is in Monitor Mode, debug exceptions cause either a data abort or a pre-fetch abort.
Page 130: Hw Breakpoint Resources
Intel XScale® Core Developer’s Manual Software Debug HW Breakpoint Resources The Elkhart debug architecture defines two instruction and two data breakpoint registers, denoted IBCR0, IBCR1, DBR0, and DBR1. The instruction and data address breakpoint registers are 32-bit registers. The instruction breakpoint causes a break before execution of the target instruction.
Page 131: Data Breakpoints
Intel XScale® Core Developer’s Manual Software Debug 9.6.2 Data Breakpoints The Elkhart debug architecture defines two data breakpoint registers (DBR0, DBR1). The format of the registers is shown in Table 9-6. Table 9-6. Data Breakpoint Register (DBRx) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9...
Page 132 Intel XScale® Core Developer’s Manual Software Debug When DBR1 is programmed as a data address mask, it is used in conjunction with the address in DBR0. The bits set in DBR1 are ignored by the processor when comparing the address of a memory access with the address in DBR0.
Page 133: Software Breakpoints
Intel XScale® Core Developer’s Manual Software Debug Software Breakpoints Mnemonics: BKPT (See ARM Architecture Reference Manual, ARMv5T) Operation: If DCSR[31] = 0, BKPT is a nop; If DCSR[31] =1, BKPT causes a debug exception The processor handles the software breakpoint as described in Section 9.5, “Debug Exceptions”...
Page 134: Transmit/Receive Control Register (Txrxctrl)
Intel XScale® Core Developer’s Manual Software Debug Transmit/Receive Control Register (TXRXCTRL) Communications between the debug handler and debugger are controlled through handshaking bits that ensures the debugger and debug handler make synchronized accesses to TX and RX. The debugger side of the handshaking is accessed through the DBGTX (Section 9.11.2, “DBGTX JTAG...
Page 135: Rx Register Ready Bit (Rr)
Intel XScale® Core Developer’s Manual Software Debug 9.8.1 RX Register Ready Bit (RR) The debugger and debug handler use the RR bit to synchronize accesses to RX. Normally, the debugger and debug handler use a handshaking scheme that requires both sides to poll the RR bit.
Page 136: Overflow Flag (Ov)
Intel XScale® Core Developer’s Manual Software Debug 9.8.2 Overflow Flag (OV) The Overflow flag is a sticky flag that is set when the debugger writes to the RX register while the RR bit is set. The flag is used during high-speed download to indicate that some data was lost. The assumption during high-speed download is that the time it takes for the debugger to shift in the next data word is greater than the time necessary for the debug handler to process the previous data word.
Page 137: Tx Register Ready Bit (Tr)
Intel XScale® Core Developer’s Manual Software Debug 9.8.4 TX Register Ready Bit (TR) The debugger and debug handler use the TR bit to synchronize accesses to the TX register. The debugger and debug handler must poll the TR bit before accessing the TX register.
Page 138: Transmit Register (Tx)
Intel XScale® Core Developer’s Manual Software Debug Transmit Register (TX) The TX register is the debug handler transmit buffer. The debug handler sends data to the debugger through this register. Table 9-13. TX Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9...
Page 139: Debug Jtag Access
Intel XScale® Core Developer’s Manual Software Debug 9.11 Debug JTAG Access There are four JTAG instructions used by the debugger during software debug: LDIC, SELDCSR, DBGTX and DBGRX. LDIC is described in Section 9.14, “Downloading Code in the Instruction Cache”. The other three JTAG instructions are described in this section. SELDCSR, DBGTX and DBGRX each use a 36-bit shift register to scan in new data and scan out captured data.
Page 140: Hold_Reset
Intel XScale® Core Developer’s Manual Software Debug 9.11.1.1 hold_reset The debugger uses hold_reset when loading code into the instruction cache during a processor reset. Details about loading code into the instruction cache are in Section 9.14, “Downloading Code in the Instruction Cache”.
Page 141: Dbgtx Jtag Register
Intel XScale® Core Developer’s Manual Software Debug 9.11.2 DBGTX JTAG Register The ‘DBGTX’ JTAG instruction selects the DBGTX JTAG data register. The JTAG opcode for this instruction is ‘0b0010000’. The debug handler uses the DBGTX data register to send data to the debugger.
Page 142: Dbgrx Jtag Register
Intel XScale® Core Developer’s Manual Software Debug 9.11.3 DBGRX JTAG Register The ‘DBGRX’ JTAG instruction selects the DBGRX JTAG data register. The JTAG opcode for this instruction is ‘0b0000010’. The debug handler uses the DBGRX data register to receive information from the debugger. A protocol can be setup between the debugger and debug handler to allow the handler to identify data values and commands.
Page 143: Rx Write Logic
Intel XScale® Core Developer’s Manual Software Debug 9.11.3.1 RX Write Logic The RX write logic (Figure 9-3) serves the following functions: 1) RX Write Enable: The RX register only gets updated when rx_valid is set and is unaffected if rx_valid is clear or an overflow occurs. In particular, when the debugger is polling DBG_SR[0], as long as rx_valid is 0, Update_DR does not modify RX.
Page 144: Rx_Valid
Intel XScale® Core Developer’s Manual Software Debug 9.11.3.6 rx_valid The debugger sets the rx_valid bit to indicate the data scanned into DBG_SR[34:3] is valid data to be written to RX. When this bit is set, the data scanned into the DBG_SR will be written to RX following an Update_DR.
Page 145: Trace Buffer
Intel XScale® Core Developer’s Manual Software Debug 9.12 Trace Buffer The 256 entry trace buffer provides the ability to capture control flow information to be used for debugging an application. Two modes are supported: 1. The buffer fills up completely and generates a debug exception. Then SW empties the buffer.
Page 146: Checkpoint Registers
Intel XScale® Core Developer’s Manual Software Debug 9.12.1.1 Checkpoint Registers When the debugger reconstructs a trace history, it is required to start at the oldest trace buffer entry and construct a trace going forward. In fill-once mode and wrap-around mode when the buffer does not wrap around, the trace can be reconstructed by starting from the point in the code where the trace buffer was first enabled.
Page 147: Trace Buffer Register (Tbreg)
Intel XScale® Core Developer’s Manual Software Debug 9.12.1.2 Trace Buffer Register (TBREG) The trace buffer is read through TBREG, using MRC and MCR. Software should only read the trace buffer when it is disabled. Reading the trace buffer while it is enabled, may cause unpredictable behavior of the trace buffer.
Page 148: Trace Buffer Entries
Intel XScale® Core Developer’s Manual Software Debug 9.13 Trace Buffer Entries Trace buffer entries consist of either one or five bytes. Most entries are one byte messages indicating the type of control flow change. The target address of the control flow change represented by the message byte is either encoded in the message byte (like for exceptions) or can be determined by looking at the instruction word (like for direct branches).
Page 149: Exception Message Byte
Intel XScale® Core Developer’s Manual Software Debug 9.13.1.1 Exception Message Byte When any kind of exception occurs, an exception message is placed in the trace buffer. In an exception message byte, the message type bit (M) is always 0. The vector exception (VVV) field is used to specify bits[4:2] of the vector address (offset from the base of default or relocated vector table).
Page 150: Non-Exception Message Byte
Intel XScale® Core Developer’s Manual Software Debug 9.13.1.2 Non-exception Message Byte Non-exception message bytes are used for direct branches, indirect branches, and rollovers. In a non-exception message byte, the 4-bit message type field (MMMM) specifies the type of message (refer to Table 9-18).
Page 151: Address Bytes
Intel XScale® Core Developer’s Manual Software Debug 9.13.1.3 Address Bytes Only indirect branch entries contain address bytes in addition to the message byte. Indirect branch entries always have four address bytes indicating the target of that indirect branch. When reading the trace buffer the MSB of the target address is read out first;...
Page 152: Trace Buffer Usage
Intel XScale® Core Developer’s Manual Software Debug 9.13.2 Trace Buffer Usage The Elkhart trace buffer is 256 bytes in length. The first byte read from the buffer represents the oldest trace history information in the buffer. The last (256th) byte read represents the most recent entry in the buffer.
Page 153 Intel XScale® Core Developer’s Manual Software Debug As the trace buffer is read, the oldest entries are read first. Reading a series of 5 (or more) consecutive “0b0000 0000” entries in the oldest entries indicates that the trace buffer has not wrapped around and the first valid entry will be the first non-zero entry read out.
Page 154: Downloading Code In The Instruction Cache
Intel XScale® Core Developer’s Manual Software Debug 9.14 Downloading Code in the Instruction Cache On Elkhart, a mini instruction cache, physically separate from the main instruction cache can be used as an on-chip instruction RAM. A debugger can download code directly into either instruction cache through JTAG.
Page 155: Ldic Jtag Command
Intel XScale® Core Developer’s Manual Software Debug 9.14.2 LDIC JTAG Command The LDIC JTAG instruction selects the JTAG data register for loading code into the instruction cache. The JTAG opcode for this instruction is ‘00111’. The LDIC instruction must be in the JTAG instruction register in order to load code directly into the instruction cache through JTAG.
Page 156: Ldic Cache Functions
It does not require a virtual address or any data arguments. Load Main IC and Load Mini IC write one line of data (8 ARM instructions) into the specified instruction cache at the specified virtual address. Load Main IC has been deprecated on the Intel ®...
Page 157: Format Of Ldic Cache Functions
Intel XScale® Core Developer’s Manual Software Debug Figure 9-8. Format of LDIC Cache Functions VA[31:5] Invalidate IC Line . . . Invalidate Mini IC - indicates first bit shifted in Data Word 7 - indicates last bit shifted in Load Main IC...
Page 158: Loading Instruction Cache During Reset
Intel XScale® Core Developer’s Manual Software Debug 9.14.5 Loading Instruction Cache During Reset Code can be downloaded into the instruction cache through JTAG during a processor reset. This feature is used during software debug to download the debug handler prior to starting a debug session.
Page 159: Steps For Loading Mini Instruction Cache During Reset
Intel XScale® Core Developer’s Manual Software Debug Table 9-20 describes the actions a debugger should take to load code into the mini instruction cache during reset: Table 9-20. Steps For Loading Mini Instruction Cache During Reset Step # Action Notes...
Page 160: Dynamically Loading Instruction Cache After Reset
Intel XScale® Core Developer’s Manual Software Debug 9.14.6 Dynamically Loading Instruction Cache After Reset An debugger can load code into the instruction cache “on the fly” or “dynamically”. This occurs when the debugger downloads code while the core is not held in reset and is useful for expanding the functionality of the debug handler.
Page 161: Steps For Dynamically Loading The Mini Instruction Cache
Intel XScale® Core Developer’s Manual Software Debug Table 9-21. Steps For Dynamically Loading the Mini Instruction Cache Action Step # Notes Debugger Debug Handler Debugger must poll DBGTX for an indication from the debug handler that it is safe to begin the download.
Page 162: Dynamic Download Synchronization Code
The Intel Debug Handler is a complete debug handler that implements the more commonly used functions, and allows less frequently used functions to be dynamically downloaded.
Page 163: Performance Considerations
Performance Considerations This chapter describes relevant performance considerations that compiler writers, application ® programmers and system designers need to be aware of to efficiently use the Intel XScale core. Performance numbers discussed here include interrupt latency, branch prediction, and instruction latencies.
Page 164: Branch Prediction
Intel XScale® Core Developer’s Manual Performance Considerations 10.2 Branch Prediction ® The Intel XScale core implements dynamic branch prediction for the ARM* instructions B and BL and for the Thumb instruction B. Any instruction that specifies the PC as the destination is predicted as not taken.
Page 165: Instruction Latencies
Intel XScale® Core Developer’s Manual Performance Considerations 10.4 Instruction Latencies The latencies for all the instructions are shown in the following sections with respect to their functional groups: branch, data processing, multiply, status register access, load/store, semaphore, and coprocessor. The following section explains how to read these tables.
Page 166: Latency Example
Intel XScale® Core Developer’s Manual Performance Considerations • Minimum Resource Latency The minimum cycle distance from the issue clock of the current multiply instruction to the issue clock of the next multiply instruction assuming the second multiply does not incur a data dependency and is immediately available from the instruction cache or memory interface.
Page 167: Branch Instruction Timings
Intel XScale® Core Developer’s Manual Performance Considerations 10.4.2 Branch Instruction Timings Table 10-3. Branch Instruction Timings (Those predicted by the BTB) Minimum Issue Latency when Correctly Minimum Issue Latency with Branch Mnemonic Predicted by the BTB Misprediction Table 10-4. Branch Instruction Timings (Those not predicted by the BTB)
Page 168: Multiply Instruction Timings
Intel XScale® Core Developer’s Manual Performance Considerations 10.4.4 Multiply Instruction Timings Table 10-6. Multiply Instruction Timings (Sheet 1 of 2) Rs Value S-Bit Minimum Minimum Result Minimum Resource Mnemonic (Early Termination) Value Issue Latency Latency Latency (Throughput) Rs[31:15] = 0x00000...
Page 169: Multiply Implicit Accumulate Instruction Timings
Intel XScale® Core Developer’s Manual Performance Considerations Table 10-6. Multiply Instruction Timings (Sheet 2 of 2) Rs Value S-Bit Minimum Minimum Result Minimum Resource Mnemonic (Early Termination) Value Issue Latency Latency Latency (Throughput) RdLo = 2; RdHi = 3 Rs[31:15] = 0x00000 RdLo = 3;...
Page 170: Saturated Arithmetic Instructions
Intel XScale® Core Developer’s Manual Performance Considerations 10.4.5 Saturated Arithmetic Instructions Table 10-9. Saturated Data Processing Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency QADD QSUB QDADD QDSUB 10.4.6 Status Register Access Instructions Table 10-10. Status Register Access Instruction Timings...
Page 171: Load/Store Instructions
Intel XScale® Core Developer’s Manual Performance Considerations 10.4.7 Load/Store Instructions Table 10-11. Load and Store Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency 3 for load data; 1 for writeback of base LDRB 3 for load data; 1 for writeback of base LDRBT 3 for load data;...
Page 172: Coprocessor Instructions
Intel XScale® Core Developer’s Manual Performance Considerations 10.4.9 Coprocessor Instructions Table 10-14. CP15 Register Access Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency MRC to R15 is unpredictable Table 10-15. CP14 Register Access Instruction Timings Mnemonic Minimum Issue Latency...
Page 173: Thumb Instructions
Intel XScale® Core Developer’s Manual Performance Considerations 10.4.11 Thumb Instructions In general, the timing of Thumb instructions are the same as their equivalent ARM instructions, except for the cases listed below. • If the equivalent ARM instruction maps to one in Table 10-3, the “Minimum Issue Latency...
Page 174 Intel XScale® Core Developer’s Manual Performance Considerations This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Page 175: Optimization Guide
It can also be used by application developers to obtain the best performance from their assembly language code. The ® optimizations presented in this chapter are based on the Intel XScale core, and hence can be applied to all products that are based on it.
Page 176: The Intel Xscale ® Core Pipeline
Optimization Guide ® The Intel XScale Core Pipeline ® One of the biggest differences between the Intel XScale core and StrongARM processors is the pipeline. Many of the differences are summarized in Figure A-1. This section provides a brief description of the structure and behavior of the core pipeline.
Page 177: The Intel Xscale ® Core Pipeline Organization
Intel XScale® Core Developer’s Manual Optimization Guide ® A.2.1.2. The Intel XScale Core Pipeline Organization ® The Intel XScale core single-issue superpipeline consists of a main execution pipeline, MAC pipeline, and a memory access pipeline. These are shown in Figure A-1, with the main execution pipeline shaded.
Page 178: Out Of Order Completion
® and store instructions. The Intel XScale core preserves a weak processor consistency because instructions may complete out of order, provided that no data dependencies exist.
Page 179: Instruction Flow Through The Pipeline
Intel XScale® Core Developer’s Manual Optimization Guide A.2.2 Instruction Flow Through the Pipeline ® The Intel XScale core pipeline issues a single instruction per clock cycle. Instruction execution begins at the F1 pipestage and completes at the WB pipestage. Although a single instruction may be issued per clock cycle, all three pipelines (MAC, memory, and main execution) may be processing instructions simultaneously.
Page 180: Main Execution Pipeline
Intel XScale® Core Developer’s Manual Optimization Guide A.2.3 Main Execution Pipeline A.2.3.1. F1 / F2 (Instruction Fetch) Pipestages The job of the instruction fetch stages F1 and F2 is to present the next instruction to be executed to the ID stage. Several important functional units reside within the F1 and F2 stages, including: •...
Page 181: Rf (Register File / Shifter) Pipestage
Intel XScale® Core Developer’s Manual Optimization Guide A.2.3.3. RF (Register File / Shifter) Pipestage The main function of the RF pipestage is to read and write to the register file unit, or RFU. It provides source data to: • EX for ALU operations •...
Page 182: Memory Pipeline
Intel XScale® Core Developer’s Manual Optimization Guide A.2.4 Memory Pipeline The memory pipeline consists of two stages, D1 and D2. The data cache unit, or DCU, consists of the data-cache array, mini-data cache, fill buffers, and writebuffers. The memory pipeline handles load / store instructions.
Page 183: Basic Optimizations
Intel XScale® Core Developer’s Manual Optimization Guide Basic Optimizations This chapter outlines optimizations specific to ARM architecture. These optimizations have been modified to suit the core where needed. A.3.1 Conditional Instructions ® The Intel XScale core architecture provides the ability to execute instructions conditionally. This feature combined with the ability of the core instructions to modify the condition codes makes possible a wide array of optimizations.
Page 184: Optimizing Branches
#0 r0, #1 The code generated above takes three cycles to execute the else part and four cycles for the if-part assuming best case conditions and no branch misprediction penalties. In the case of the Intel ® XScale core, a branch misprediction incurs a penalty of four cycles. If the branch is mispredicted 50% of the time, and if we consider that both the if-part and the else-part are equally likely to be taken, on an average the code above takes 5.5 cycles to execute.
Page 185 Intel XScale® Core Developer’s Manual Optimization Guide Consider that we have the following data: Number of cycles to execute the if_stmt assuming the use of branch instructions Number of cycles to execute the else_stmt assuming the use of branch instructions...
Page 186: Optimizing Complex Expressions
Intel XScale® Core Developer’s Manual Optimization Guide A.3.1.3. Optimizing Complex Expressions Conditional instructions should also be used to improve the code generated for complex expressions such as the C shortcut evaluation feature. Consider the following C code segment: int foo(int a, int b) if (a != 0 &&...
Page 187: Bit Field Manipulation
Intel XScale® Core Developer’s Manual Optimization Guide A.3.2 Bit Field Manipulation ® The Intel XScale core shift and logical operations provide a useful way of manipulating bit fields. Bit field operations can be optimized as follows: ;Set the bit number specified by r1 in register r0...
Page 188: Optimizing The Use Of Immediate Values
Intel XScale® Core Developer’s Manual Optimization Guide A.3.3 Optimizing the Use of Immediate Values ® The Intel XScale core MOV or MVN instruction should be used when loading an immediate (constant) value into a register. Please refer to the ARM Architecture Reference Manual for the set of immediate values that can be used in a MOV or MVN instruction.
Page 189: Optimizing Integer Multiply And Divide
Intel XScale® Core Developer’s Manual Optimization Guide A.3.4 Optimizing Integer Multiply and Divide Multiplication by an integer constant should be optimized to make use of the shift operation whenever possible. ;Multiplication of R0 by 2 r0, r0, LSL #n ;Multiplication of R0 by 2 r0, r0, r0, LSL #n ·...
Page 190: Effective Use Of Addressing Modes
Intel XScale® Core Developer’s Manual Optimization Guide A.3.5 Effective Use of Addressing Modes ® The Intel XScale core provides a variety of addressing modes that make indexing an array of objects highly efficient. For a detailed description of these addressing modes please refer to the ARM Architecture Reference Manual.
Page 191: A.4.1 Instruction Cache
Intel XScale® Core Developer’s Manual Optimization Guide Cache and Prefetch Optimizations This section considers how to use the various cache memories in all their modes and then examines when and how to use prefetch to improve execution efficiencies. A.4.1 Instruction Cache ®...
Page 192: Locking Code Into The Instruction Cache
Intel XScale® Core Developer’s Manual Optimization Guide A.4.1.4. Locking Code into the Instruction Cache One very important instruction cache feature is the ability to lock code into the instruction cache. Once locked into the instruction cache, the code is always available for fast execution. Another reason for locking critical code into cache is that with the round robin replacement policy, eventually the code will be evicted, even if it is a very frequently executed function.
Page 193: Data And Mini Cache
Intel XScale® Core Developer’s Manual Optimization Guide A.4.2 Data and Mini Cache ® The Intel XScale core allows the user to define memory regions whose cache policies can be set by the user (see Section 6.2.3, “Cache Policies”). Supported policies and configurations are: •...
Page 194: Read Allocate And Read-Write Allocate Memory Regions
Intel XScale® Core Developer’s Manual Optimization Guide A.4.2.3. Read Allocate and Read-write Allocate Memory Regions Most of the regular data and the stack for your application should be allocated to a read-write allocate region. It is expected that you will be writing and reading from them often.
Page 195: Mini-Data Cache
Intel XScale® Core Developer’s Manual Optimization Guide A.4.2.5. Mini-data Cache The mini-data cache is best used for data structures, which have short temporal lives, and/or cover vast amounts of data space. Addressing these types of data spaces from the Data cache would corrupt much if not all of the Data cache by evicting valuable data.
Page 196: Data Alignment
Intel XScale® Core Developer’s Manual Optimization Guide A.4.2.6. Data Alignment Cache lines begin on 32-byte address boundaries. To maximize cache line use and minimize cache pollution, data structures should be aligned on 32 byte boundaries and sized to multiple cache line sizes.
Page 197: Literal Pools
Intel XScale® Core Developer’s Manual Optimization Guide A.4.2.7. Literal Pools ® The Intel XScale core does not have a single instruction that can move all literals (a constant or address) to a register. One technique to load registers with literals in the core is by loading the literal from a memory location that has been initialized with the constant or address.
Page 198: Cache Considerations
Intel XScale® Core Developer’s Manual Optimization Guide A.4.3 Cache Considerations A.4.3.1. Cache Conflicts, Pollution and Pressure Cache pollution occurs when unused data is loaded in the cache and cache pressure occurs when data that is not temporal to the current process is loaded into the cache. For an example, see Section A.4.4.2., “Prefetch Loop Scheduling”...
Page 199: Prefetch Considerations
Prefetch Distances Scheduling the prefetch instruction requires understanding the system latency times and system ® resources which affect when to use the prefetch instruction. Refer to the Intel XScale core implementation option section of the ASSP architecture specification for more information.
Page 200: Low Number Of Iterations
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.5. Low Number of Iterations Loops with very low iteration counts may have the advantages of prefetch completely mitigated. A loop with a small fixed number of iterations may be faster if the loop is completely unrolled rather than trying to schedule prefetch instructions.
Page 201: Cache Memory Considerations
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.7. Cache Memory Considerations Stride, the way data structures are walked through, can affect the temporal quality of the data and reduce or increase cache conflicts. The data cache and mini-data caches each have 32 sets of 32 bytes.
Page 202 Intel XScale® Core Developer’s Manual Optimization Guide In the data structure shown above, the fields Year2DatePay, Year2DateTax, Year2Date401KDed, and Year2DateOtherDed are likely to change with each pay check. The remaining fields however change very rarely. If the fields are laid out as shown above, assuming that the structure is aligned on a 32-byte boundary, modifications to the Year2Date fields is likely to use two write buffers when the data is written out to memory.
Page 203: Cache Blocking
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.8. Cache Blocking Cache blocking techniques, such as strip-mining, are used to improve temporal locality of the data. Given a large data set that can be reused across multiple passes of a loop, data blocking divides the...
Page 204: Pointer Prefetch
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.10. Pointer Prefetch Not all looping constructs contain induction variables. However, prefetching techniques can still be applied. Consider the following linked list traversal example: while(p) { do_something(p->data); p = p->next; The pointer variable p becomes a pseudo induction variable and the data pointed to by p->next can be prefetched to reduce data transfer latency for the next iteration of the loop.
Page 205: Loop Interchange
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.11. Loop Interchange As mentioned earlier, the sequence in which data is accessed affects cache thrashing. Usually, it is best to access data in a contiguous spatially address range. However, arrays of data may have been laid out such that indexed elements are not physically next to each other.
Page 206: Prefetch To Reduce Register Pressure
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.13. Prefetch to Reduce Register Pressure Prefetch can be used to reduce register pressure. When data is needed for an operation, then the load is scheduled far enough in advance to hide the load latency. However, the load ties up the receiving register until the data can be used.
Page 207: Instruction Scheduling
Intel XScale® Core Developer’s Manual Optimization Guide Instruction Scheduling This chapter discusses instruction scheduling optimizations. Instruction scheduling refers to the rearrangement of a sequence of instructions for the purpose of minimizing pipeline stalls. Reducing the number of pipeline stalls improves application performance. While making this rearrangement, care should be taken to ensure that the rearranged sequence of instructions has the same effect as the original sequence of instructions.
Page 208 Intel XScale® Core Developer’s Manual Optimization Guide The result latency for an LDR instruction is significantly higher if the data being loaded is not in the data cache. To minimize the number of pipeline stalls in such a situation the LDR instruction should be moved as far away as possible from the instruction that uses result of the load.
Page 209 Intel XScale® Core Developer’s Manual Optimization Guide ® The Intel XScale core has 4 fill-buffers that are used to fetch data from external memory when a data-cache miss occurs. The core stalls when all fill buffers are in use. This happens when more than 4 loads are outstanding and are being fetched from memory.
Page 210: Scheduling Load And Store Double (Ldrd/Strd)
Intel XScale® Core Developer’s Manual Optimization Guide A.5.1.1. Scheduling Load and Store Double (LDRD/STRD) ® The Intel XScale core introduces two new double word instructions: LDRD and STRD. LDRD loads 64-bits of data from an effective address into two consecutive registers, conversely, STRD stores 64-bits from two consecutive registers to an effective address.
Page 211: Scheduling Load And Store Multiple (Ldm/Stm)
Intel XScale® Core Developer’s Manual Optimization Guide A.5.1.2. Scheduling Load and Store Multiple (LDM/STM) LDM and STM instructions have an issue latency of 2-20 cycles depending on the number of registers being loaded or stored. The issue latency is typically 2 cycles plus an additional cycle for each of the registers being loaded or stored assuming a data cache hit.
Page 212: Scheduling Data Processing Instructions
Intel XScale® Core Developer’s Manual Optimization Guide A.5.2 Scheduling Data Processing Instructions Most core data processing instructions have a result latency of 1 cycle. This means that the current instruction is able to use the result from the previous data processing instruction. However, the result latency is 2 cycles if the current instruction needs to use the result of the previous data processing instruction for a shift by immediate.
Page 213: Scheduling Multiply Instructions
Intel XScale® Core Developer’s Manual Optimization Guide A.5.3 Scheduling Multiply Instructions Multiply instructions can cause pipeline stalls due to either resource conflicts or result latencies. The following code segment would incur a stall of 0-3 cycles depending on the values in registers r1, r2, r4 and r5 due to resource conflicts.
Page 214: Scheduling Swp And Swpb Instructions
Intel XScale® Core Developer’s Manual Optimization Guide A.5.4 Scheduling SWP and SWPB Instructions The SWP and SWPB instructions have a 5 cycle issue latency. As a result of this latency, the instruction following the SWP/SWPB instruction would stall for 4 cycles. SWP and SWPB instructions should, therefore, be used only where absolutely needed.
Page 215: Scheduling The Mra And Mar Instructions (Mrrc/Mcrr)
Intel XScale® Core Developer’s Manual Optimization Guide A.5.5 Scheduling the MRA and MAR Instructions (MRRC/MCRR) The MRA (MRRC) instruction has an issue latency of 1 cycle, a result latency of 2 or 3 cycles depending on the destination register value being accessed and a resource latency of 2 cycles.
Page 216: Scheduling The Mia And Miaph Instructions
Intel XScale® Core Developer’s Manual Optimization Guide A.5.6 Scheduling the MIA and MIAPH Instructions The MIA instruction has an issue latency of 1 cycle. The result and resource latency can vary from 1 to 3 cycles depending on the values in the source register.
Page 217: Scheduling Mrs And Msr Instructions
Intel XScale® Core Developer’s Manual Optimization Guide A.5.7 Scheduling MRS and MSR Instructions The MRS instruction has an issue latency of 1 cycle and a result latency of 2 cycles. The MSR instruction has an issue latency of 2 cycles (6 if updating the mode bits) and a result latency of 1 cycle.
Page 218: Optimizing C Libraries
Intel XScale® Core Developer’s Manual Optimization Guide Optimizing C Libraries Many of the standard C library routines can benefit greatly by being optimized for the core architecture. The following string and memory manipulation routines should be tuned to obtain the...
Page 219: Test Features
Test Features ® ® This chapter gives a brief overview of the Intel XScale core JTAG features. The Intel XScale core provides a baseline set of features from with the ASSP builds upon. A full description of these features can be found in the ASSP architecture specification.
Page 220 Intel XScale® Core Developer’s Manual Test Features This Page Intentionally Left Blank January, 2004 Developer’s Manual...

Intel XScale Core Developer's Manual

1 Introduction

2 Programming Model

3 Memory Management

4 Instruction Cache

5 Branch Target Buffer

6 Data Cache

7 Configuration

8 Performance Monitoring

9 Software Debug

10 Performance Considerations

Quick Links

Need help?

Questions and answers

Subscribe to Our Youtube Channel

Related Manuals for Intel XScale Core

Summary of Contents for Intel XScale Core