Page 2
TokenExpress, Trillium, Vivonic, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. The ARM* and ARM Powered logo marks (the ARM marks) are trademarks of ARM, Ltd., and Intel uses these marks under license from ARM, Ltd. *Other names and brands may be claimed as the property of others.
Introduction ............................ 13 About This Document ......................13 1.1.1 How to Read This Document ................. 13 1.1.2 Other Relevant Documents ................... 14 ® High-Level Overview of the Intel XScale Core..............15 1.2.1 ARM Compatibility ....................15 1.2.2 Features......................... 16 1.2.2.1 Multiply/Accumulate (MAC)..............16 1.2.2.2...
Page 4
Intel XScale® Core Developer’s Manual Contents 3.2.2.2 Cacheable (C), Bufferable (B), and eXtension (X) Bits......38 3.2.2.3 Instruction Cache ................... 38 3.2.2.4 Data Cache and Write Buffer ..............39 3.2.2.5 Details on Data Cache and Write Buffer Behavior......... 40 3.2.2.6 Memory Operation Ordering ..............
Page 5
Intel XScale® Core Developer’s Manual Contents 6.2.4 Round-Robin Replacement Algorithm ..............68 6.2.5 Parity Protection ....................68 6.2.6 Atomic Accesses ....................68 Data Cache and Mini-Data Cache Control .................69 6.3.1 Data Memory State After Reset ................69 6.3.2 Enabling/Disabling ....................69 6.3.3...
Intel retains the right to make changes to these specifications at any time, without notice. In particular, descriptions of features, timings, and pin-outs does not imply a commitment to implement them.
This document describes Version 5TE of the ARM Architecture which includes Thumb ISA and ARM DSP-Enhanced ISA. (ISBN 0 201 737191) • StrongARM SA-1100 Microprocessor Developer’s Manual, Intel Order # 278105 • StrongARM SA-110 Microprocessor Technical Reference Manual, Intel Order #278104 January, 2004 Developer’s Manual...
1.2.1 ARM Compatibility ARM Version 5 (V5) Architecture added floating point instructions to ARM Version 4. The Intel ® XScale core implements the integer instruction set architecture of ARM V5, but does not provide hardware support of the floating point instructions.
Intel XScale® Core Developer’s Manual Introduction 1.2.2 Features ® Figure 1-1 shows the major functional blocks of the Intel XScale core. The following sections give a brief, high-level overview of these blocks. Figure 1-1. Architecture Features Data Cache Mini- Instruction Cache •...
Intel XScale® Core Developer’s Manual Introduction 1.2.2.2 Memory Management ® The Intel XScale core implements the Memory Management Unit (MMU) Architecture specified in the ARM Architecture Reference Manual. The MMU provides access protection and virtual to physical address translation. The MMU Architecture also specifies the caching policies for the instruction cache and data memory.
1.2.2.6 Performance Monitoring ® Performance monitoring counters have been added to the Intel XScale core that can be configured to monitor various events in the core. These events allow a software developer to measure cache efficiency, detect system bottlenecks and reduce the overall latency of programs.
Once an entry is flushed in the cache it can no longer be used by the program. ® XSC1 XSC1 refers to a variant of the Intel XScale core denoted by a CoreGen (Coprocessor 15, ID Register) value of 0x1. This variant has a 2 counter performance monitor and a 5-bit JTAG instruction register. See Table 7-4, “ID Register”...
Page 20
Intel XScale® Core Developer’s Manual Introduction This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Intel XScale® Core Developer’s Manual Programming Model Programming Model ® This chapter describes the programming model of the Intel XScale core, namely the implementation options and extensions to the ARM Version 5TE architecture. ARM Architecture Compatibility ® The Intel XScale core implements the integer instruction set architecture specified in ARM V5TE.
Section 2.3.1.2 for more information. Access ® to coprocessors 15 and 14 generate an undefined instruction exception. Refer to the Intel XScale core implementation option section of the ASSP architecture specification for the behavior when accessing all other coprocessors. 2.2.5...
CP0. If this is the case, a ® complete definition can be found in the Intel XScale core implementation option section of the ASSP architecture specification. For this very reason, software should not rely on behavior that is specific to the 40-bit length of the accumulator, since the length may be extended.
Rm - Multiplicand Two new fields were created for this format, acc and opcode_3. The acc field specifies 1 of 8 internal accumulators to operate on and opcode_3 defines the operation for this format. The Intel ® XScale core defines a single 40-bit accumulator referred to as acc0; future implementations may ®...
Intel XScale® Core Developer’s Manual Programming Model 2.3.1.2 Internal Accumulator Access Format ® The Intel XScale core defines a new instruction format for accessing internal accumulators in CP0. Table 2-5, “Internal Accumulator Access Format” on page 2-27 shows that the opcode falls into the coprocessor register transfer space.
P bit in the first level descriptors to allow an ASSP to identify a ® new memory attribute. Refer to the Intel XScale core implementation option section of the ASSP architecture specification to find out how the P bit has been defined. Bit 1 in the Control Register (coprocessor 15, register 1, opcode=1) is used to assigned the P bit memory attribute for memory accesses made during page table walks.
Intel XScale® Core Developer’s Manual Programming Model 2.3.3 Additions to CP15 Functionality ® To accommodate the functionality in the Intel XScale core, registers in CP15 and CP14 have been added or augmented. See Chapter 7, “Configuration” for details. At times it is necessary to be able to guarantee exactly when a CP15 update takes effect. For example, when enabling memory address translation (turning on the MMU), it is vital to know when the MMU is actually guaranteed to be in operation.
Intel XScale® Core Developer’s Manual Programming Model 2.3.4 Event Architecture 2.3.4.1 Exception Summary Table 2-11 shows all the exceptions that the core may generate, and the attributes of each. Subsequent sections give details on each exception. Table 2-11. Exception Summary...
Intel XScale® Core Developer’s Manual Programming Model 2.3.4.3 Prefetch Aborts ® The Intel XScale core detects three types of prefetch aborts: Instruction MMU abort, external abort on an instruction access, and an instruction cache parity error. These aborts are described in Table 2-13.
2.3.4.4 Data Aborts ® Two types of data aborts exist in the Intel XScale core: precise and imprecise. A precise data abort is defined as one where R14_ABORT always contains the PC (+8) of the instruction that caused the exception. An imprecise abort is one where R14_ABORT contains the PC (+4) of the next instruction to execute and not the address of the instruction that caused the abort.
Intel XScale® Core Developer’s Manual Programming Model Although the core guarantees the Base Restored Abort Model for precise aborts, it cannot do so in the case of imprecise aborts. A Data Abort handler may encounter an updated base register if it is invoked because of an imprecise abort.
Intel XScale® Core Developer’s Manual Programming Model This feature allows software to issue PLDs speculatively. For example, Example 2-3 on page 2-36 places a PLD instruction early in the loop. This PLD is used to fetch data for the next loop iteration.
Intel XScale® Core Developer’s Manual Memory Management Memory Management ® This chapter describes the memory management unit implemented in the Intel XScale core. Overview ® The Intel XScale core implements the Memory Management Unit (MMU) Architecture specified in the ARM Architecture Reference Manual. To accelerate virtual to physical address translation, the core uses both an instruction Translation Look-aside Buffer (TLB) and a data TLB to cache the latest translations.
The P bit allows an ASSP to assign its own page attribute to a memory region. This bit is only ® present in the first level descriptors. Refer to the Intel XScale core implementation section of the ASSP architecture specification to find out how this has been defined. Accesses to memory for page table walks do not use the MMU.
Intel XScale® Core Developer’s Manual Memory Management 3.2.2.4 Data Cache and Write Buffer All of these descriptor bits affect the behavior of the Data Cache and the Write Buffer. If the X bit for a descriptor is zero, the C and B bits operate as mandated by the ARM architecture.
Intel XScale® Core Developer’s Manual Memory Management 3.2.2.5 Details on Data Cache and Write Buffer Behavior If the MMU is disabled all data accesses will be non-cacheable and non-bufferable. This is the same behavior as when the MMU is enabled, and a data access uses a descriptor with X, C, and B all set to 0.
Intel XScale® Core Developer’s Manual Memory Management Interaction of the MMU, Instruction Cache, and Data Cache The MMU, instruction cache, and data/mini-data cache may be enabled/disabled independently. The instruction cache can be enabled with the MMU enabled or disabled. However, the data cache can only be enabled when the MMU is enabled.
Intel XScale® Core Developer’s Manual Memory Management Control 3.4.1 Invalidate (Flush) Operation The entire instruction and data TLB can be invalidated at the same time with one command or they can be invalidated separately. An individual entry in the data or instruction TLB can also be invalidated.
Intel XScale® Core Developer’s Manual Memory Management 3.4.3 Locking Entries Individual entries can be locked into the instruction and data TLBs. See Table 7-14, “Cache Lockdown Functions” on page 7-90 for the exact commands. If a lock operation finds the virtual address translation already resident in the TLB, the results are unpredictable.
Page 44
Intel XScale® Core Developer’s Manual Memory Management The proper procedure for locking entries into the data TLB is shown in Example 3-3 on page 3-44. Example 3-3. Locking Entries into the Data TLB ; R1, and R2 contain the virtual addresses to translate and lock into the data TLB P15,0,R1,C8,C6,1 ;...
Intel XScale® Core Developer’s Manual Memory Management 3.4.4 Round-Robin Replacement Algorithm The line replacement algorithm for the TLBs is round-robin; there is a round-robin pointer that keeps track of the next entry to replace. The next entry to replace is the one sequentially after the last entry that was written.
Page 46
Intel XScale® Core Developer’s Manual Memory Management This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Intel XScale® Core Developer’s Manual Instruction Cache Instruction Cache ® The Intel XScale core instruction cache enhances performance by reducing the number of instruction fetches from external memory. The cache provides fast execution of cached code. Code can also be locked down when guaranteed or fast access time is required.
Intel XScale® Core Developer’s Manual Instruction Cache Operation 4.2.1 Operation When Instruction Cache is Enabled When the cache is enabled, it compares every instruction request address against the addresses of instructions that it is currently holding. If the cache contains the requested instruction, the access “hits”...
Intel XScale® Core Developer’s Manual Instruction Cache 4.2.3 Fetch Policy An instruction-cache “miss” occurs when the requested instruction is not found in the instruction fetch buffers or instruction cache; a fetch request is then made to external memory. The instruction cache can handle up to two “misses.”...
Intel XScale® Core Developer’s Manual Instruction Cache 4.2.5 Parity Protection The instruction cache is protected by parity to ensure data integrity. Each instruction cache word has 1 parity bit. (The instruction cache tag is NOT parity protected.) When a parity error is detected on an instruction cache access, a prefetch abort exception occurs if the core attempts to execute the instruction.
4.2.6 Instruction Fetch Latency The instruction fetch latency is dependent on the core to memory frequency ratio, system bus bandwidth, system memory, etc., which are all particular to each ASSP. So, refer to the Intel ® XScale core implementation option section of the ASSP architecture specification for exact details on instruction fetch latency.
Intel XScale® Core Developer’s Manual Instruction Cache Instruction Cache Control 4.3.1 Instruction Cache State at RESET After reset, the instruction cache is always disabled, unlocked, and invalidated (flushed). 4.3.2 Enabling/Disabling The instruction cache is enabled by setting bit 12 in coprocessor 15, register 1 (Control Register).
Intel XScale® Core Developer’s Manual Instruction Cache 4.3.3 Invalidating the Instruction Cache The entire instruction cache along with the fetch buffers are invalidated by writing to coprocessor 15, register 7. (See Table 7-12, “Cache Functions” on page 7-87 for the exact command.) This command does not unlock any lines that were locked in the instruction cache nor...
Intel XScale® Core Developer’s Manual Instruction Cache 4.3.4 Locking Instructions in the Instruction Cache Software has the ability to lock performance critical routines into the instruction cache. Up to 28 lines in each set can be locked; hardware will ignore the lock command if software is trying to lock all the lines in a particular set (i.e., ways 28-31can never be locked).
Intel XScale® Core Developer’s Manual Instruction Cache Software can lock down several different routines located at different memory locations. This may cause some sets to have more locked lines than others as shown in Figure 4-2. Example 4-4 on page 4-55 shows how a routine, called “lockMe”...
Page 56
Intel XScale® Core Developer’s Manual Instruction Cache This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Intel XScale® Core Developer’s Manual Branch Target Buffer Branch Target Buffer ® The Intel XScale core uses dynamic branch prediction to reduce the penalties associated with changing the flow of program execution. The core features a branch target buffer that provides the instruction cache with the target address of branch type instructions.
Intel XScale® Core Developer’s Manual Branch Target Buffer The history bits represent four possible prediction states for a branch entry in the BTB. Figure 5-2, “Branch History” on page 5-58 shows these states along with the possible transitions. The initial state for branches stored in the BTB is Weakly-Taken (WT).
Intel XScale® Core Developer’s Manual Branch Target Buffer BTB Control 5.2.1 Disabling/Enabling The BTB is always disabled with Reset. Software can enable the BTB through a bit in a coprocessor register (see Section 7.2.2). Before enabling or disabling the BTB, software must invalidate it (described in the following section).
Page 60
Intel XScale® Core Developer’s Manual Branch Target Buffer This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Intel XScale® Core Developer’s Manual Data Cache Data Cache ® The Intel XScale core data cache enhances performance by reducing the number of data accesses to and from external memory. There are two data cache structures in the core, a data cache with two...
Intel XScale® Core Developer’s Manual Data Cache Figure 6-1. Data Cache Organization Set 31 Example: 32 Kbyte cache way 0 32 bytes (cache line) way 1 Set Index DATA Set 1 way 0 32 bytes (cache line) Set 0 way 1...
Intel XScale® Core Developer’s Manual Data Cache 6.1.2 Mini-Data Cache Overview The mini-data cache is 1/16 the size of the data cache, so depending on the data cache size selected the available sizes are 2 K or 1 Kbytes. The 2 Kbyte version has 32 sets and the 1 Kbyte version has 16 sets;...
Intel XScale® Core Developer’s Manual Data Cache 6.1.3 Write Buffer and Fill Buffer Overview ® The Intel XScale core employs an eight entry write buffer, each entry containing 16 bytes. Stores to external memory are first placed in the write buffer and subsequently taken out when the bus is available.
Intel XScale® Core Developer’s Manual Data Cache Data Cache and Mini-Data Cache Operation The following discussions refer to the data cache and mini-data cache as one cache (data/mini-data) since their behavior is the same when accessed. 6.2.1 Operation When Caching is Enabled When the data/mini-data cache is enabled for an access, the data/mini-data cache compares the address of the request against the addresses of data that it is currently holding.
Intel XScale® Core Developer’s Manual Data Cache 6.2.3.2 Read Miss Policy The following sequence of events occurs when a cacheable (see Section 6.2.3.1, “Cacheability” on page 6-65) load operation misses the cache: 1. The fill buffer is checked to see if an outstanding fill request already exists for that line.
Intel XScale® Core Developer’s Manual Data Cache 6.2.3.3 Write Miss Policy A write operation that misses the cache will request a 32-byte cache line from external memory if the access is cacheable and write allocation is specified in the page. In this case the following sequence of events occur: 1.
Intel XScale® Core Developer’s Manual Data Cache 6.2.4 Round-Robin Replacement Algorithm The line replacement algorithm for the data cache is round-robin. Each set in the data cache has a round-robin pointer that keeps track of the next line (in that set) to replace. The next line to replace in a set is the next sequential line after the last one that was just filled.
Intel XScale® Core Developer’s Manual Data Cache Data Cache and Mini-Data Cache Control 6.3.1 Data Memory State After Reset After processor reset, both the data cache and mini-data cache are disabled, all valid bits are set to zero (invalid), and the round-robin bit points to way 31. Any lines in the data cache that were configured as data RAM before reset are changed back to cacheable lines after reset, i.e., there are...
Intel XScale® Core Developer’s Manual Data Cache 6.3.3.1 Global Clean and Invalidate Operation A simple software routine is used to globally clean the data cache. It takes advantage of the line-allocate data cache operation, which allocates a line into the data cache. This allocation evicts any cache dirty data back to external memory.
Intel XScale® Core Developer’s Manual Data Cache Re-configuring the Data Cache as Data RAM Software has the ability to lock tags associated with 32-byte lines in the data cache, thus creating the appearance of data RAM. Any subsequent access to this line will always hit the cache unless it is invalidated.
Page 72
Intel XScale® Core Developer’s Manual Data Cache Example 6-3. Locking Data into the Data Cache ; R1 contains the virtual address of a region of memory to lock, ; configured with C=1 and B=1 ; R0 is the number of 32-byte lines to lock into the data cache. In this ;...
Page 73
Intel XScale® Core Developer’s Manual Data Cache Example 6-4. Creating Data RAM ; R1 contains the virtual address of a region of memory to configure as data RAM, ; which is aligned on a 32-byte boundary. ; MMU is configured so that the memory region is cacheable.
Intel XScale® Core Developer’s Manual Data Cache Tags can be locked into the data cache by enabling the data cache lock mode bit located in coprocessor 15, register 9. (See Table 7-14, “Cache Lockdown Functions” on page 7-90 for the exact command.) Once enabled, any new lines allocated into the data cache will be locked down.
Note that an ASSP may ® also include operations external to the core in the drain operation. (Refer to the Intel XScale core implementation option section in the ASSP architecture specification for more details.) See Table 7-12, “Cache Functions”...
Page 76
Intel XScale® Core Developer’s Manual Data Cache This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Any access to CP14 in user mode will cause an undefined instruction exception. ® Coprocessors, CP15 and CP14, on the Intel XScale core do not support access via CDP, MRRC, or MCRR instructions. An attempt to access these coprocessors with these instructions will result in an undefined instruction exception.
0b1111 = CP15 0b1110 = CP14 0x0000 = CP0 11:8 cp_num - coprocessor number ® NOTE: Refer to the Intel XScale core implementation option section of the ASSP architecture specification to see if there are any other coprocessors defined by the ASSP.
® The Intel XScale core defines the following: 0b1111 = Undefined Exception 0b1110 = CP14 ® NOTE: Refer to the Intel XScale core 11:8 cp_num - coprocessor number implementation option section of the ASSP architecture specification to find out the meaning of the other encodings.
The ID Register is selected when opcode_2=0. This register returns the code for the ASSP, where a ® portion of it is defined by the ASSP. Refer to the Intel XScale core implementation option section of the ASSP architecture specification for the exact encoding.
Intel XScale® Core Developer’s Manual Configuration 7.2.2 Register 1: Control & Auxiliary Control Registers Register 1 is made up of two registers, one that is compliant with ARM Version 5TE and referred by opcode_2 = 0x0, and the other which is specific to the core is referred by opcode_2 = 0x1. The latter is known as the Auxiliary Control Register.
Read-Unpredictable / Reserved Write-as-Zero Page Table Memory Attribute (P) This field is defined by ® the ASSP. Refer to the Intel XScale core implementation Read / Write option section of the ASSP architecture specification for more information. Write Buffer Coalescing Disable (K)
Intel XScale® Core Developer’s Manual Configuration 7.2.6 Register 5: Fault Status Register The Fault Status Register (FSR) indicates which fault has occurred, which could be either a prefetch abort or a data abort. Bit 10 extends the encoding of the status field for prefetch aborts and data aborts.
Intel XScale® Core Developer’s Manual Configuration 7.2.8 Register 7: Cache Functions This register should be accessed as write-only. Reads from this register, as with an MRC, have an undefined effect. The Drain Write Buffer function not only drains the write buffer but also drains the fill buffer.The core does not check permissions on addresses supplied for cache or TLB functions.
Page 88
Intel XScale® Core Developer’s Manual Configuration Other items to note about the line-allocate command are: • It forces all pending memory operations to complete. • Bits [31:5] of Rd is used to specific the virtual address of the line to allocated into the data cache.
Intel XScale® Core Developer’s Manual Configuration 7.2.9 Register 8: TLB Operations Disabling/enabling the MMU has no effect on the contents of either TLB: valid entries stay valid, locked items remain locked. All operations defined in Table 7-13 work regardless of whether the TLB is enabled or disabled.
Intel XScale® Core Developer’s Manual Configuration 7.2.10 Register 9: Cache Lock Down Register 9 is used for locking down entries into the instruction cache and data cache. (The protocol for locking down entries can be found in Chapter 6, “Data Cache”.)
Intel XScale® Core Developer’s Manual Configuration 7.2.11 Register 10: TLB Lock Down Register 10 is used for locking down entries into the instruction TLB, and data TLB. (The protocol for locking down entries can be found in Chapter 3, “Memory Management”.) Lock/unlock...
Intel XScale® Core Developer’s Manual Configuration 7.2.13.1 The PID Register Affect On Addresses All addresses generated and used by User Mode code are eligible for being “PIDified” as described in the previous section. Privileged code, however, must be aware of certain special cases in which address generation does not follow the usual flow.
(DBR0), one configurable data mask/address register (DBR1), and one data breakpoint control register (DBCON). ® Refer to Chapter 9, “Software Debug” for more information on these features of the Intel XScale core. Table 7-19. Accessing the Debug Registers Function...
This register controls access to CP0 and other coprocessors (CP1 through CP13) that may exist in ® an ASSP. (See the Intel XScale core implementation option section of the ASSP architecture specification for a list of coprocessors that may have been implemented.) A typical use for this register is for an operating system to control resource sharing among applications.
Read-as-Zero/Write-as-Zero compatibility Coprocessor Access Rights - Each bit in this field corresponds to the access rights for ® each coprocessor. Refer to the Intel XScale core 13:1 Read / Write implementation option section of the ASSP architecture specification to find out which, if any, coprocessors exist and for the definition of these bits.
Intel XScale® Core Developer’s Manual Configuration CP14 Registers CP14 contains software debug registers, clock and power management registers and the performance monitor registers. All other registers are reserved in CP14. Reading and writing them yields unpredictable results. 7.3.1 Performance Monitoring Registers There are two variants of the performance monitoring facility;...
Intel XScale® Core Developer’s Manual Configuration 7.3.1.2 XSC2 Performance Monitoring Registers The performance monitoring unit in XSC2 contains a control register (PMNC), a clock counter (CCNT), interrupt enable register (INTEN), overflow flag register (FLAG), event selection register (EVTSEL) and four event counters (PMN0 through PMN3). The format of these registers can be found in Chapter 8, “Performance...
= 0x0). This function informs the clocking unit (located external to the core) to change core clock frequency. Software can read CCLKCFG to determine current operating frequency. Exact ® definition of this register can be found in the Intel XScale core implementation option section of the ASSP architecture specification.
Intel XScale® Core Developer’s Manual Configuration 7.3.3 Software Debug Registers Software debug is supported by address breakpoint registers (Coprocessor 15, register 14), serial communication over the JTAG interface and a trace buffer. Registers 8, 9 and 14 are used for the serial interface, register 10 is for general control and registers 11 through 13 support a 256 entry trace buffer.
Page 100
Intel XScale® Core Developer’s Manual Configuration This Page Intentionally Left Blank January, 2004 Developer’s Manual...
If any of the counters overflow, an interrupt request will occur if it’s enabled. (What happens to the interrupt request is definable by the ASSP, which typically contains an interrupt controller that handles priority, masking, steering to FIQ or IRQ, etc. Refer to the Intel ®...
Intel XScale® Core Developer’s Manual Performance Monitoring 8.2.2 Performance Count Registers (PMN0 - PMN1; CP14 - Register 2 and 3, Respectively) There are two 32-bit event counters; their format is shown in Table 8-7. The event counters are reset to ‘0’ by the PMNC register or can be set to a predetermined value by directly writing to them.
Intel XScale® Core Developer’s Manual Performance Monitoring 8.2.4.1 Managing PMNC The following are a few notes about controlling the performance monitoring mechanism: • An interrupt will be reported when a counter’s overflow flag is set and its associated interrupt enable bit is set in the PMNC register. The interrupt will remain asserted until software clears the overflow flag by writing a one to the flag that is set.
Intel XScale® Core Developer’s Manual Performance Monitoring 8.3.2 Performance Count Registers (PMN0 - PMN3) There are four 32-bit event counters; their format is shown in Table 8-7. The event counters are reset to ‘0’ by setting bit 1 in the PMNC register or can be set to a predetermined value by directly writing to them.
Intel XScale® Core Developer’s Manual Performance Monitoring 8.3.3 Performance Monitor Control Register (PMNC) The performance monitor control register (PMNC) is a coprocessor register that: • contains the PMU ID • extends CCNT counting by six more bits (cycles between counter rollover = 2 •...
Intel XScale® Core Developer’s Manual Performance Monitoring 8.3.5 Overflow Flag Status Register (FLAG) FLAG identifies which counter has overflowed and also indicates an interrupt has been requested if the overflowing counter’s corresponding interrupt enable bit (contained within INTEN) is asserted.
Intel XScale® Core Developer’s Manual Performance Monitoring 8.3.6 Event Select Register (EVTSEL) EVTSEL is used to select events for PMN0, PMN1, PMN2 and PMN3. Refer to Table 8-12, “Performance Monitoring Events” on page 8-113 for a list of possible events.
Intel XScale® Core Developer’s Manual Performance Monitoring 8.3.7 Managing the Performance Monitor The following are a few notes about controlling the performance monitoring mechanism: • An interrupt request will be generated when a counter’s overflow flag is set and its associated interrupt enable bit is set in INTEN.
PC changes to the event address, e.g., IRQ, FIQ, SWI, etc. ® 0x10 through Defined by ASSP. See the Intel XScale core implementation option section of the ASSP 0x17 architecture specification for more details.
Intel XScale® Core Developer’s Manual Performance Monitoring Some typical combinations of counted events are listed in this section and summarized in Table 8-13. In this section, we call such an event combination a mode. Table 8-13. Some Common Uses of the PMU...
Intel XScale® Core Developer’s Manual Performance Monitoring 8.4.1 Instruction Cache Efficiency Mode PMN0 totals the number of instructions that were executed, which does not include instructions fetched from the instruction cache that were never executed. This can happen if a branch instruction changes the program flow;...
This is calculated by dividing PMN0 by PMN1. This statistic lets you know if the duration event cycles are due to many requests or are attributed to just a few ® requests. If the average is high then the Intel XScale core may be starved of the external bus. •...
Intel XScale® Core Developer’s Manual Performance Monitoring 8.4.6 Instruction TLB Efficiency Mode PMN0 totals the number of instructions that were executed, which does not include instructions that were translated by the instruction TLB and never executed. This can happen if a branch instruction changes the program flow;...
Intel XScale® Core Developer’s Manual Performance Monitoring Multiple Performance Monitoring Run Statistics There may be times when the number of events to be monitored exceed the number of counters. In this case, multiple performance monitoring runs can be done, capturing different events from each run.
Intel XScale® Core Developer’s Manual Performance Monitoring Examples The same example is shown below for both variants (XSC1 and XSC2). 8.6.1 XSC1 Example (2 counter variant) In this example, the events selected with the Instruction Cache Efficiency mode are monitored and CCNT is used to measure total execution time.
Intel XScale® Core Developer’s Manual Performance Monitoring 8.6.2 XSC2 Example (4 counter variant) In this example, the events selected with the Instruction Cache Efficiency mode are monitored and CCNT is used to measure total execution time. Sampling time ends when PMN0 overflows which will generate an IRQ interrupt.
Intel XScale® Core Developer’s Manual Software Debug Software Debug This chapter describes the software debug and related features implemented in Elkhart, namely: • debug modes, registers and exceptions. • a serial debug communication link via the JTAG interface. • a trace buffer.
Intel XScale® Core Developer’s Manual Software Debug Introduction The Elkhart debug unit, when used with a debugger application, allows software running on an Elkhart target to be debugged. The debug unit allows the debugger to stop program execution and re-direct execution to a debug handling routine. Once program execution has stopped, the debugger can examine or modify processor state, co-processor state, or memory.
Intel XScale® Core Developer’s Manual Software Debug Debug Control and Status Register (DCSR) The DCSR register is the main control register for the debug unit. Table 9-1 shows the format of the register. The DCSR register can be accessed in privileged modes by software running on the core or by a debugger through the JTAG interface.
SOC Break (B) ® Reading the SOC Break bit returns the value of the SOC break input into the Intel XScale core Use of the SOC break input to the core (used to generate SOC debug breaks) is product specific and is targeted towards chips that need system-on-a-chip debug capabilities.
Intel XScale® Core Developer’s Manual Software Debug 9.4.4 Vector Trap Bits (TF,TI,TD,TA,TS,TU,TR) The Vector Trap bits allow instruction breakpoints to be set on exception vectors without using up any of the breakpoint registers. When a bit is set, it acts as if an instruction breakpoint was set up on the corresponding exception vector.
Intel XScale® Core Developer’s Manual Software Debug Debug Exceptions A debug exception causes the processor to re-direct execution to a debug event handling routine. The Elkhart debug architecture defines the following debug exceptions: • instruction breakpoint • data breakpoint •...
Intel XScale® Core Developer’s Manual Software Debug 9.5.1 Halt Mode The debugger turns on Halt Mode through the JTAG interface by scanning in a value that sets the bit in DCSR. The debugger turns off Halt Mode through JTAG, either by scanning in a new DCSR value or by a TRST.
Page 128
Intel XScale® Core Developer’s Manual Software Debug Following a debug exception, the processor switches to debug mode and enters SDS, which allows the following special functionality: • All events are disabled. SWI or undefined instructions have unpredictable results. The processor ignores pre-fetch aborts, FIQ and IRQ (SDS disables FIQ and IRQ regardless of the enable values in the CPSR).
Intel XScale® Core Developer’s Manual Software Debug 9.5.2 Monitor Mode In Monitor Mode, the processor handles debug exceptions like normal ARM exceptions, except for SOC debug breaks, which are handled like Halt Mode exceptions. If debug functionality is enabled and the processor is in Monitor Mode, debug exceptions cause either a data abort or a pre-fetch abort.
Intel XScale® Core Developer’s Manual Software Debug HW Breakpoint Resources The Elkhart debug architecture defines two instruction and two data breakpoint registers, denoted IBCR0, IBCR1, DBR0, and DBR1. The instruction and data address breakpoint registers are 32-bit registers. The instruction breakpoint causes a break before execution of the target instruction.
Intel XScale® Core Developer’s Manual Software Debug 9.6.2 Data Breakpoints The Elkhart debug architecture defines two data breakpoint registers (DBR0, DBR1). The format of the registers is shown in Table 9-6. Table 9-6. Data Breakpoint Register (DBRx) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9...
Page 132
Intel XScale® Core Developer’s Manual Software Debug When DBR1 is programmed as a data address mask, it is used in conjunction with the address in DBR0. The bits set in DBR1 are ignored by the processor when comparing the address of a memory access with the address in DBR0.
Intel XScale® Core Developer’s Manual Software Debug Software Breakpoints Mnemonics: BKPT (See ARM Architecture Reference Manual, ARMv5T) Operation: If DCSR[31] = 0, BKPT is a nop; If DCSR[31] =1, BKPT causes a debug exception The processor handles the software breakpoint as described in Section 9.5, “Debug Exceptions”...
Intel XScale® Core Developer’s Manual Software Debug Transmit/Receive Control Register (TXRXCTRL) Communications between the debug handler and debugger are controlled through handshaking bits that ensures the debugger and debug handler make synchronized accesses to TX and RX. The debugger side of the handshaking is accessed through the DBGTX (Section 9.11.2, “DBGTX JTAG...
Intel XScale® Core Developer’s Manual Software Debug 9.8.1 RX Register Ready Bit (RR) The debugger and debug handler use the RR bit to synchronize accesses to RX. Normally, the debugger and debug handler use a handshaking scheme that requires both sides to poll the RR bit.
Intel XScale® Core Developer’s Manual Software Debug 9.8.2 Overflow Flag (OV) The Overflow flag is a sticky flag that is set when the debugger writes to the RX register while the RR bit is set. The flag is used during high-speed download to indicate that some data was lost. The assumption during high-speed download is that the time it takes for the debugger to shift in the next data word is greater than the time necessary for the debug handler to process the previous data word.
Intel XScale® Core Developer’s Manual Software Debug 9.8.4 TX Register Ready Bit (TR) The debugger and debug handler use the TR bit to synchronize accesses to the TX register. The debugger and debug handler must poll the TR bit before accessing the TX register.
Intel XScale® Core Developer’s Manual Software Debug 9.11 Debug JTAG Access There are four JTAG instructions used by the debugger during software debug: LDIC, SELDCSR, DBGTX and DBGRX. LDIC is described in Section 9.14, “Downloading Code in the Instruction Cache”. The other three JTAG instructions are described in this section. SELDCSR, DBGTX and DBGRX each use a 36-bit shift register to scan in new data and scan out captured data.
Intel XScale® Core Developer’s Manual Software Debug 9.11.1.1 hold_reset The debugger uses hold_reset when loading code into the instruction cache during a processor reset. Details about loading code into the instruction cache are in Section 9.14, “Downloading Code in the Instruction Cache”.
Intel XScale® Core Developer’s Manual Software Debug 9.11.2 DBGTX JTAG Register The ‘DBGTX’ JTAG instruction selects the DBGTX JTAG data register. The JTAG opcode for this instruction is ‘0b0010000’. The debug handler uses the DBGTX data register to send data to the debugger.
Intel XScale® Core Developer’s Manual Software Debug 9.11.3 DBGRX JTAG Register The ‘DBGRX’ JTAG instruction selects the DBGRX JTAG data register. The JTAG opcode for this instruction is ‘0b0000010’. The debug handler uses the DBGRX data register to receive information from the debugger. A protocol can be setup between the debugger and debug handler to allow the handler to identify data values and commands.
Intel XScale® Core Developer’s Manual Software Debug 9.11.3.1 RX Write Logic The RX write logic (Figure 9-3) serves the following functions: 1) RX Write Enable: The RX register only gets updated when rx_valid is set and is unaffected if rx_valid is clear or an overflow occurs. In particular, when the debugger is polling DBG_SR[0], as long as rx_valid is 0, Update_DR does not modify RX.
Intel XScale® Core Developer’s Manual Software Debug 9.11.3.6 rx_valid The debugger sets the rx_valid bit to indicate the data scanned into DBG_SR[34:3] is valid data to be written to RX. When this bit is set, the data scanned into the DBG_SR will be written to RX following an Update_DR.
Intel XScale® Core Developer’s Manual Software Debug 9.12 Trace Buffer The 256 entry trace buffer provides the ability to capture control flow information to be used for debugging an application. Two modes are supported: 1. The buffer fills up completely and generates a debug exception. Then SW empties the buffer.
Intel XScale® Core Developer’s Manual Software Debug 9.12.1.1 Checkpoint Registers When the debugger reconstructs a trace history, it is required to start at the oldest trace buffer entry and construct a trace going forward. In fill-once mode and wrap-around mode when the buffer does not wrap around, the trace can be reconstructed by starting from the point in the code where the trace buffer was first enabled.
Intel XScale® Core Developer’s Manual Software Debug 9.12.1.2 Trace Buffer Register (TBREG) The trace buffer is read through TBREG, using MRC and MCR. Software should only read the trace buffer when it is disabled. Reading the trace buffer while it is enabled, may cause unpredictable behavior of the trace buffer.
Intel XScale® Core Developer’s Manual Software Debug 9.13 Trace Buffer Entries Trace buffer entries consist of either one or five bytes. Most entries are one byte messages indicating the type of control flow change. The target address of the control flow change represented by the message byte is either encoded in the message byte (like for exceptions) or can be determined by looking at the instruction word (like for direct branches).
Intel XScale® Core Developer’s Manual Software Debug 9.13.1.1 Exception Message Byte When any kind of exception occurs, an exception message is placed in the trace buffer. In an exception message byte, the message type bit (M) is always 0. The vector exception (VVV) field is used to specify bits[4:2] of the vector address (offset from the base of default or relocated vector table).
Intel XScale® Core Developer’s Manual Software Debug 9.13.1.2 Non-exception Message Byte Non-exception message bytes are used for direct branches, indirect branches, and rollovers. In a non-exception message byte, the 4-bit message type field (MMMM) specifies the type of message (refer to Table 9-18).
Intel XScale® Core Developer’s Manual Software Debug 9.13.1.3 Address Bytes Only indirect branch entries contain address bytes in addition to the message byte. Indirect branch entries always have four address bytes indicating the target of that indirect branch. When reading the trace buffer the MSB of the target address is read out first;...
Intel XScale® Core Developer’s Manual Software Debug 9.13.2 Trace Buffer Usage The Elkhart trace buffer is 256 bytes in length. The first byte read from the buffer represents the oldest trace history information in the buffer. The last (256th) byte read represents the most recent entry in the buffer.
Page 153
Intel XScale® Core Developer’s Manual Software Debug As the trace buffer is read, the oldest entries are read first. Reading a series of 5 (or more) consecutive “0b0000 0000” entries in the oldest entries indicates that the trace buffer has not wrapped around and the first valid entry will be the first non-zero entry read out.
Intel XScale® Core Developer’s Manual Software Debug 9.14 Downloading Code in the Instruction Cache On Elkhart, a mini instruction cache, physically separate from the main instruction cache can be used as an on-chip instruction RAM. A debugger can download code directly into either instruction cache through JTAG.
Intel XScale® Core Developer’s Manual Software Debug 9.14.2 LDIC JTAG Command The LDIC JTAG instruction selects the JTAG data register for loading code into the instruction cache. The JTAG opcode for this instruction is ‘00111’. The LDIC instruction must be in the JTAG instruction register in order to load code directly into the instruction cache through JTAG.
It does not require a virtual address or any data arguments. Load Main IC and Load Mini IC write one line of data (8 ARM instructions) into the specified instruction cache at the specified virtual address. Load Main IC has been deprecated on the Intel ®...
Intel XScale® Core Developer’s Manual Software Debug Figure 9-8. Format of LDIC Cache Functions VA[31:5] Invalidate IC Line . . . Invalidate Mini IC - indicates first bit shifted in Data Word 7 - indicates last bit shifted in Load Main IC...
Intel XScale® Core Developer’s Manual Software Debug 9.14.5 Loading Instruction Cache During Reset Code can be downloaded into the instruction cache through JTAG during a processor reset. This feature is used during software debug to download the debug handler prior to starting a debug session.
Intel XScale® Core Developer’s Manual Software Debug Table 9-20 describes the actions a debugger should take to load code into the mini instruction cache during reset: Table 9-20. Steps For Loading Mini Instruction Cache During Reset Step # Action Notes...
Intel XScale® Core Developer’s Manual Software Debug 9.14.6 Dynamically Loading Instruction Cache After Reset An debugger can load code into the instruction cache “on the fly” or “dynamically”. This occurs when the debugger downloads code while the core is not held in reset and is useful for expanding the functionality of the debug handler.
Intel XScale® Core Developer’s Manual Software Debug Table 9-21. Steps For Dynamically Loading the Mini Instruction Cache Action Step # Notes Debugger Debug Handler Debugger must poll DBGTX for an indication from the debug handler that it is safe to begin the download.
The Intel Debug Handler is a complete debug handler that implements the more commonly used functions, and allows less frequently used functions to be dynamically downloaded.
Performance Considerations This chapter describes relevant performance considerations that compiler writers, application ® programmers and system designers need to be aware of to efficiently use the Intel XScale core. Performance numbers discussed here include interrupt latency, branch prediction, and instruction latencies.
Intel XScale® Core Developer’s Manual Performance Considerations 10.2 Branch Prediction ® The Intel XScale core implements dynamic branch prediction for the ARM* instructions B and BL and for the Thumb instruction B. Any instruction that specifies the PC as the destination is predicted as not taken.
Intel XScale® Core Developer’s Manual Performance Considerations 10.4 Instruction Latencies The latencies for all the instructions are shown in the following sections with respect to their functional groups: branch, data processing, multiply, status register access, load/store, semaphore, and coprocessor. The following section explains how to read these tables.
Intel XScale® Core Developer’s Manual Performance Considerations • Minimum Resource Latency The minimum cycle distance from the issue clock of the current multiply instruction to the issue clock of the next multiply instruction assuming the second multiply does not incur a data dependency and is immediately available from the instruction cache or memory interface.
Intel XScale® Core Developer’s Manual Performance Considerations 10.4.7 Load/Store Instructions Table 10-11. Load and Store Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency 3 for load data; 1 for writeback of base LDRB 3 for load data; 1 for writeback of base LDRBT 3 for load data;...
Intel XScale® Core Developer’s Manual Performance Considerations 10.4.11 Thumb Instructions In general, the timing of Thumb instructions are the same as their equivalent ARM instructions, except for the cases listed below. • If the equivalent ARM instruction maps to one in Table 10-3, the “Minimum Issue Latency...
Page 174
Intel XScale® Core Developer’s Manual Performance Considerations This Page Intentionally Left Blank January, 2004 Developer’s Manual...
It can also be used by application developers to obtain the best performance from their assembly language code. The ® optimizations presented in this chapter are based on the Intel XScale core, and hence can be applied to all products that are based on it.
Optimization Guide ® The Intel XScale Core Pipeline ® One of the biggest differences between the Intel XScale core and StrongARM processors is the pipeline. Many of the differences are summarized in Figure A-1. This section provides a brief description of the structure and behavior of the core pipeline.
Intel XScale® Core Developer’s Manual Optimization Guide ® A.2.1.2. The Intel XScale Core Pipeline Organization ® The Intel XScale core single-issue superpipeline consists of a main execution pipeline, MAC pipeline, and a memory access pipeline. These are shown in Figure A-1, with the main execution pipeline shaded.
® and store instructions. The Intel XScale core preserves a weak processor consistency because instructions may complete out of order, provided that no data dependencies exist.
Intel XScale® Core Developer’s Manual Optimization Guide A.2.2 Instruction Flow Through the Pipeline ® The Intel XScale core pipeline issues a single instruction per clock cycle. Instruction execution begins at the F1 pipestage and completes at the WB pipestage. Although a single instruction may be issued per clock cycle, all three pipelines (MAC, memory, and main execution) may be processing instructions simultaneously.
Intel XScale® Core Developer’s Manual Optimization Guide A.2.3 Main Execution Pipeline A.2.3.1. F1 / F2 (Instruction Fetch) Pipestages The job of the instruction fetch stages F1 and F2 is to present the next instruction to be executed to the ID stage. Several important functional units reside within the F1 and F2 stages, including: •...
Intel XScale® Core Developer’s Manual Optimization Guide A.2.3.3. RF (Register File / Shifter) Pipestage The main function of the RF pipestage is to read and write to the register file unit, or RFU. It provides source data to: • EX for ALU operations •...
Intel XScale® Core Developer’s Manual Optimization Guide A.2.4 Memory Pipeline The memory pipeline consists of two stages, D1 and D2. The data cache unit, or DCU, consists of the data-cache array, mini-data cache, fill buffers, and writebuffers. The memory pipeline handles load / store instructions.
Intel XScale® Core Developer’s Manual Optimization Guide Basic Optimizations This chapter outlines optimizations specific to ARM architecture. These optimizations have been modified to suit the core where needed. A.3.1 Conditional Instructions ® The Intel XScale core architecture provides the ability to execute instructions conditionally. This feature combined with the ability of the core instructions to modify the condition codes makes possible a wide array of optimizations.
#0 r0, #1 The code generated above takes three cycles to execute the else part and four cycles for the if-part assuming best case conditions and no branch misprediction penalties. In the case of the Intel ® XScale core, a branch misprediction incurs a penalty of four cycles. If the branch is mispredicted 50% of the time, and if we consider that both the if-part and the else-part are equally likely to be taken, on an average the code above takes 5.5 cycles to execute.
Page 185
Intel XScale® Core Developer’s Manual Optimization Guide Consider that we have the following data: Number of cycles to execute the if_stmt assuming the use of branch instructions Number of cycles to execute the else_stmt assuming the use of branch instructions...
Intel XScale® Core Developer’s Manual Optimization Guide A.3.1.3. Optimizing Complex Expressions Conditional instructions should also be used to improve the code generated for complex expressions such as the C shortcut evaluation feature. Consider the following C code segment: int foo(int a, int b) if (a != 0 &&...
Intel XScale® Core Developer’s Manual Optimization Guide A.3.2 Bit Field Manipulation ® The Intel XScale core shift and logical operations provide a useful way of manipulating bit fields. Bit field operations can be optimized as follows: ;Set the bit number specified by r1 in register r0...
Intel XScale® Core Developer’s Manual Optimization Guide A.3.3 Optimizing the Use of Immediate Values ® The Intel XScale core MOV or MVN instruction should be used when loading an immediate (constant) value into a register. Please refer to the ARM Architecture Reference Manual for the set of immediate values that can be used in a MOV or MVN instruction.
Intel XScale® Core Developer’s Manual Optimization Guide A.3.4 Optimizing Integer Multiply and Divide Multiplication by an integer constant should be optimized to make use of the shift operation whenever possible. ;Multiplication of R0 by 2 r0, r0, LSL #n ;Multiplication of R0 by 2 r0, r0, r0, LSL #n ·...
Intel XScale® Core Developer’s Manual Optimization Guide A.3.5 Effective Use of Addressing Modes ® The Intel XScale core provides a variety of addressing modes that make indexing an array of objects highly efficient. For a detailed description of these addressing modes please refer to the ARM Architecture Reference Manual.
Intel XScale® Core Developer’s Manual Optimization Guide Cache and Prefetch Optimizations This section considers how to use the various cache memories in all their modes and then examines when and how to use prefetch to improve execution efficiencies. A.4.1 Instruction Cache ®...
Intel XScale® Core Developer’s Manual Optimization Guide A.4.1.4. Locking Code into the Instruction Cache One very important instruction cache feature is the ability to lock code into the instruction cache. Once locked into the instruction cache, the code is always available for fast execution. Another reason for locking critical code into cache is that with the round robin replacement policy, eventually the code will be evicted, even if it is a very frequently executed function.
Intel XScale® Core Developer’s Manual Optimization Guide A.4.2 Data and Mini Cache ® The Intel XScale core allows the user to define memory regions whose cache policies can be set by the user (see Section 6.2.3, “Cache Policies”). Supported policies and configurations are: •...
Intel XScale® Core Developer’s Manual Optimization Guide A.4.2.3. Read Allocate and Read-write Allocate Memory Regions Most of the regular data and the stack for your application should be allocated to a read-write allocate region. It is expected that you will be writing and reading from them often.
Intel XScale® Core Developer’s Manual Optimization Guide A.4.2.5. Mini-data Cache The mini-data cache is best used for data structures, which have short temporal lives, and/or cover vast amounts of data space. Addressing these types of data spaces from the Data cache would corrupt much if not all of the Data cache by evicting valuable data.
Intel XScale® Core Developer’s Manual Optimization Guide A.4.2.6. Data Alignment Cache lines begin on 32-byte address boundaries. To maximize cache line use and minimize cache pollution, data structures should be aligned on 32 byte boundaries and sized to multiple cache line sizes.
Intel XScale® Core Developer’s Manual Optimization Guide A.4.2.7. Literal Pools ® The Intel XScale core does not have a single instruction that can move all literals (a constant or address) to a register. One technique to load registers with literals in the core is by loading the literal from a memory location that has been initialized with the constant or address.
Intel XScale® Core Developer’s Manual Optimization Guide A.4.3 Cache Considerations A.4.3.1. Cache Conflicts, Pollution and Pressure Cache pollution occurs when unused data is loaded in the cache and cache pressure occurs when data that is not temporal to the current process is loaded into the cache. For an example, see Section A.4.4.2., “Prefetch Loop Scheduling”...
Prefetch Distances Scheduling the prefetch instruction requires understanding the system latency times and system ® resources which affect when to use the prefetch instruction. Refer to the Intel XScale core implementation option section of the ASSP architecture specification for more information.
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.5. Low Number of Iterations Loops with very low iteration counts may have the advantages of prefetch completely mitigated. A loop with a small fixed number of iterations may be faster if the loop is completely unrolled rather than trying to schedule prefetch instructions.
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.7. Cache Memory Considerations Stride, the way data structures are walked through, can affect the temporal quality of the data and reduce or increase cache conflicts. The data cache and mini-data caches each have 32 sets of 32 bytes.
Page 202
Intel XScale® Core Developer’s Manual Optimization Guide In the data structure shown above, the fields Year2DatePay, Year2DateTax, Year2Date401KDed, and Year2DateOtherDed are likely to change with each pay check. The remaining fields however change very rarely. If the fields are laid out as shown above, assuming that the structure is aligned on a 32-byte boundary, modifications to the Year2Date fields is likely to use two write buffers when the data is written out to memory.
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.8. Cache Blocking Cache blocking techniques, such as strip-mining, are used to improve temporal locality of the data. Given a large data set that can be reused across multiple passes of a loop, data blocking divides the...
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.10. Pointer Prefetch Not all looping constructs contain induction variables. However, prefetching techniques can still be applied. Consider the following linked list traversal example: while(p) { do_something(p->data); p = p->next; The pointer variable p becomes a pseudo induction variable and the data pointed to by p->next can be prefetched to reduce data transfer latency for the next iteration of the loop.
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.11. Loop Interchange As mentioned earlier, the sequence in which data is accessed affects cache thrashing. Usually, it is best to access data in a contiguous spatially address range. However, arrays of data may have been laid out such that indexed elements are not physically next to each other.
Intel XScale® Core Developer’s Manual Optimization Guide A.4.4.13. Prefetch to Reduce Register Pressure Prefetch can be used to reduce register pressure. When data is needed for an operation, then the load is scheduled far enough in advance to hide the load latency. However, the load ties up the receiving register until the data can be used.
Intel XScale® Core Developer’s Manual Optimization Guide Instruction Scheduling This chapter discusses instruction scheduling optimizations. Instruction scheduling refers to the rearrangement of a sequence of instructions for the purpose of minimizing pipeline stalls. Reducing the number of pipeline stalls improves application performance. While making this rearrangement, care should be taken to ensure that the rearranged sequence of instructions has the same effect as the original sequence of instructions.
Page 208
Intel XScale® Core Developer’s Manual Optimization Guide The result latency for an LDR instruction is significantly higher if the data being loaded is not in the data cache. To minimize the number of pipeline stalls in such a situation the LDR instruction should be moved as far away as possible from the instruction that uses result of the load.
Page 209
Intel XScale® Core Developer’s Manual Optimization Guide ® The Intel XScale core has 4 fill-buffers that are used to fetch data from external memory when a data-cache miss occurs. The core stalls when all fill buffers are in use. This happens when more than 4 loads are outstanding and are being fetched from memory.
Intel XScale® Core Developer’s Manual Optimization Guide A.5.1.1. Scheduling Load and Store Double (LDRD/STRD) ® The Intel XScale core introduces two new double word instructions: LDRD and STRD. LDRD loads 64-bits of data from an effective address into two consecutive registers, conversely, STRD stores 64-bits from two consecutive registers to an effective address.
Intel XScale® Core Developer’s Manual Optimization Guide A.5.1.2. Scheduling Load and Store Multiple (LDM/STM) LDM and STM instructions have an issue latency of 2-20 cycles depending on the number of registers being loaded or stored. The issue latency is typically 2 cycles plus an additional cycle for each of the registers being loaded or stored assuming a data cache hit.
Intel XScale® Core Developer’s Manual Optimization Guide A.5.2 Scheduling Data Processing Instructions Most core data processing instructions have a result latency of 1 cycle. This means that the current instruction is able to use the result from the previous data processing instruction. However, the result latency is 2 cycles if the current instruction needs to use the result of the previous data processing instruction for a shift by immediate.
Intel XScale® Core Developer’s Manual Optimization Guide A.5.3 Scheduling Multiply Instructions Multiply instructions can cause pipeline stalls due to either resource conflicts or result latencies. The following code segment would incur a stall of 0-3 cycles depending on the values in registers r1, r2, r4 and r5 due to resource conflicts.
Intel XScale® Core Developer’s Manual Optimization Guide A.5.4 Scheduling SWP and SWPB Instructions The SWP and SWPB instructions have a 5 cycle issue latency. As a result of this latency, the instruction following the SWP/SWPB instruction would stall for 4 cycles. SWP and SWPB instructions should, therefore, be used only where absolutely needed.
Intel XScale® Core Developer’s Manual Optimization Guide A.5.5 Scheduling the MRA and MAR Instructions (MRRC/MCRR) The MRA (MRRC) instruction has an issue latency of 1 cycle, a result latency of 2 or 3 cycles depending on the destination register value being accessed and a resource latency of 2 cycles.
Intel XScale® Core Developer’s Manual Optimization Guide A.5.6 Scheduling the MIA and MIAPH Instructions The MIA instruction has an issue latency of 1 cycle. The result and resource latency can vary from 1 to 3 cycles depending on the values in the source register.
Intel XScale® Core Developer’s Manual Optimization Guide A.5.7 Scheduling MRS and MSR Instructions The MRS instruction has an issue latency of 1 cycle and a result latency of 2 cycles. The MSR instruction has an issue latency of 2 cycles (6 if updating the mode bits) and a result latency of 1 cycle.
Intel XScale® Core Developer’s Manual Optimization Guide Optimizing C Libraries Many of the standard C library routines can benefit greatly by being optimized for the core architecture. The following string and memory manipulation routines should be tuned to obtain the...
Test Features ® ® This chapter gives a brief overview of the Intel XScale core JTAG features. The Intel XScale core provides a baseline set of features from with the ASSP builds upon. A full description of these features can be found in the ASSP architecture specification.
Page 220
Intel XScale® Core Developer’s Manual Test Features This Page Intentionally Left Blank January, 2004 Developer’s Manual...
Need help?
Do you have a question about the XScale Core and is the answer not in the manual?
Questions and answers