Core 2 duo mobile processor, intel core 2 solo mobile processor and intel core 2 extreme mobile processor on 45-nm process, platforms based on mobile intel 4 series express chipset family (113 pages)
Pentium m processor on 90 nm process with 2-mb l2 cache (84 pages)
Summary of Contents for Intel PXA255
Page 1
Intel® XScale™ Microarchitecture for the PXA255 Processor User’s Manual March, 2003 Order Number: 278796...
Page 2
Information in this document is provided in connection with Intel® products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel®...
Contents Introduction...........................1-1 About This Document ......................1-1 1.1.1 How to Read This Document ................1-1 1.1.2 Other Relevant Documents ..................1-1 High-Level Overview of the Intel® XScale™ core as Implemented in the Application Processors ......................1-2 1.2.1 ARM* Compatibility ....................1-3 1.2.2 Features........................1-3 1.2.2.1 Multiply/Accumulate (MAC)..............1-3 1.2.2.2...
Page 4
Operation When Data Caching is Disabled ............6-4 6.2.3 Cache Policies ......................6-4 6.2.3.1 Cacheability ..................6-4 6.2.3.2 Read Miss Policy ..................6-4 6.2.3.3 Write Miss Policy...................6-5 6.2.3.4 Write-Back Versus Write-Through ............6-6 6.2.4 Round-Robin Replacement Algorithm ..............6-6 6.2.5 Parity Protection ....................6-6 6.2.6 Atomic Accesses ....................6-7 Intel® XScale™ Microarchitecture User’s Manual...
Page 8
Intel® XScale™ Core Pipeline..................A-1 A.2.1 General Pipeline Characteristics ................. A-2 A.2.1.1. Number of Pipeline Stages ..............A-2 A.2.1.2. Intel® XScale™ Core Pipeline Organization ........A-2 A.2.1.3. Out Of Order Completion ..............A-3 A.2.1.4. Register Dependencies................ A-3 A.2.1.5. Use of Bypassing ................. A-3 A.2.2...
Page 9
Use of PLD Instructions ..................A-32 A.6.4 Thumb Instructions .................... A-32 Figures 1-1 Intel® XScale™ Microarchitecture Architecture Features ............1-3 3-1 Example of Locked Entries in TLB.....................3-8 4-1 Instruction Cache Organization ....................4-1 4-2 Locked Line Effect on Round Robin Replacement ..............4-6...
Page 10
10-11Code Download During a Cold Reset For Debug ..............10-35 10-12Code Download During a Warm Reset For Debug..............10-37 10-13Downloading Code in IC During Program Execution ............10-38 Intel® XScale™ Core RISC Superpipeline...........A-2 Tables 2-1 Multiply with Internal Accumulate Format..................2-4 2-2 MIA{<cond>} acc0, Rm, Rs .......................2-4 2-3 MIAPH{<cond>} acc0, Rm, Rs ....................2-5...
Intel retains the right to make changes to these specifications at any time, without notice. In particular, descriptions of features, timings, and pin-outs does not imply a commitment to implement them.
This document limits itself to describing the implementation of the Intel® XScale™ core as it is implemented in the PXA255 processor. In almost every attribute the Intel® XScale™ core used in the application processor is identical to the Intel® XScale™ core implemented in the Intel®...
Introduction The Intel® XScale™ core provides the ARM* V5T Thumb instruction set and the ARM* V5E DSP extensions. To further enhance multimedia applications, the Intel® XScale™ core includes additional Multiply-Accumulate functionality as the first instantiation of Intel® Media Processing Technology. These new operations from Intel are mapped into ARM* coprocessor space.
1.2.2.4 Branch Target Buffer The Intel® XScale™ core provides a Branch Target Buffer (BTB) to predict the outcome of branch type instructions. It provides storage for the target address of branch type instructions and predicts the next address to present to the instruction cache when the current instruction address is that of a branch.
Access Port (TAP) Controller implementation, which is based on IEEE 1149.1 (JTAG) Standard Test Access Port and Boundary-Scan Architecture. The purpose of the TAP controller is to support test logic internal and external to the Intel® XScale™ core such as built-in self-test and boundary- scan.
Software should not modify reserved fields or depend on any values in reserved fields. Translation Look-aside Buffer, a cache of Page Table descriptors loaded from memory to minimize page-table walking overhead. Intel® XScale™ Microarchitecture User’s Manual...
2.2.1 Big Endian versus Little Endian The Intel® XScale™ core supports both big and little endian data representation. The B-bit of the Control Register (Coprocessor 15, register 1, bit 7) selects big and little endian mode. The default behavior of the application processor at reset is little endian. To run in big endian mode, the B bit must be set before attempting any sub-word accesses to memory.
Base Restored Abort Model. Extensions to ARM* Architecture The Intel® XScale™ core made a few extensions to the ARM* Version 5 architecture to meet the needs of various markets and design requirements. The following is a list of the extensions which are discussed in the next sections.
2.3.1 DSP Coprocessor 0 (CP0) The Intel® XScale™ core adds a DSP coprocessor to the architecture for the purpose of increasing the performance and the precision of audio processing algorithms. This coprocessor contains a 40- bit accumulator and 8 new instructions.
Two new fields were created for this format, acc and opcode_3. The acc field specifies 1 of 8 internal accumulators to operate on and opcode_3 defines the operation for this format. The Intel® XScale™ core defines a single 40-bit accumulator referred to as acc0; future implementations may define multiple internal accumulators.
11-5. Specifying R15 for register Rs or Rm has unpredictable results. acc0 is defined to be 0b000 on the Intel® XScale™ core The MIAPH instruction performs two16-bit signed multiplies on packed half word data and accumulates these to a single 40-bit accumulator. The first signed multiplication is performed on the lower 16 bits of the value in register Rs with the lower 16 bits of the value in register Rm.
The acc field specifies 1 of 8 internal accumulators to transfer data to/ from. The Intel® XScale™ core implements a single 40-bit accumulator referred to as acc0; future implementations can specify multiple internal accumulators of varying sizes.
Section 7.2.13, “Register 15: Coprocessor Access Register” on page 7-14 for more details). The Intel® XScale™ core implements two instructions MAR and MRA that move two ARM* registers to acc0 and move acc0 to two ARM* registers, respectively. Table 2-5. Internal Accumulator Access Format...
RdLo. Bits[39:32] of the value in acc0 are sign extended to 32 bits and moved into the register RdHi. The instruction is only executed if the condition specified in the instruction matches the condition code status. This instruction executes in any processor mode. Intel® XScale™ Microarchitecture User’s Manual...
Fine Page Table” on page 2-10. Two second-level descriptor formats have been defined for the Intel® XScale™ core, one is used for the coarse page table and the other is used for the fine page table. AP bits are ARM* Access Permission controls.
Tiny Page Base Address C B 1 1 The TEX (Type Extension) field is present in several of the descriptor types. In the Intel® XScale™ core, only the LSB of this field is used; this is called the X bit.
Exception Summary Table 2-11 shows all the exceptions that the Intel® XScale™ core may generate, and the attributes of each. Subsequent sections give details on each exception. A precise exception is defined as one where R14_mode always contains a pointer to locate the instruction that caused the exception.
2.3.4.4 Data Aborts Two types of data aborts exist in the Intel® XScale™ core: precise and imprecise. A precise data abort is defined as one where R14_ABORT always contains the PC (+8) of the instruction that caused the exception. An imprecise abort is one where R14_ABORT contains the PC (+4) of the next instruction to execute and not the address of the instruction that caused the abort.
+ 4, which is the same for both ARM* and Thumb mode. Although the Intel® XScale™ core guarantees the Base Restored Abort Model for precise aborts, it cannot do so in the case of imprecise aborts. A Data Abort handler may encounter an updated base register if it is invoked because of an imprecise abort.
When execution reaches the end of the list, the PLD on address 0x0 will not cause a fault. Rather, it will be ignored and the loop will terminate normally. 2-14 Intel® XScale™ Microarchitecture User’s Manual...
MOVS R0, R1 ; Advance to next node. At end of list? BNE sumList ; If not then loop 2.3.4.6 Debug Events Debug events are covered in Section 10.4, “Debug Exceptions” on page 10-5. Intel® XScale™ Microarchitecture User’s Manual 2-15...
Page 34
Programming Model 2-16 Intel® XScale™ Microarchitecture User’s Manual...
TLB along with the access rights and attributes of the page or section. These translations can also be locked down in either TLB to guarantee the performance of critical routines. The Intel® XScale™ core allows system software to associate various attributes with regions of memory: •...
These attributes are ignored when the MMU is disabled. To allow compatibility with older system software, the new Intel® XScale™ core attributes take advantage of encoding space in the descriptors that were formerly reserved and defaulted to zero.
Thus software may issue a fence to impose a partial ordering on memory accesses. Table 3-3 on page 3-4 shows the circumstances in which memops act as fences. Intel® XScale™ Microarchitecture User’s Manual...
An individual entry in the data or instruction TLB can also be invalidated. See Table 7-13, “TLB Functions” on page 7-11 for a listing of commands supported by the Intel® XScale™ core. Intel® XScale™ Microarchitecture User’s Manual...
Locking entries into either the instruction TLB or data TLB reduces the available number of entries (by the number that was locked down) for hardware to cache other virtual to physical address translations. A procedure for locking entries into the instruction TLB is shown in Example 3-2 on page 3-6. Intel® XScale™ Microarchitecture User’s Manual...
Page 40
Software should disable interrupts (FIQ or IRQ) in this case. As a general rule, software should avoid locking in anything other than Supervisor mode. The proper procedure for locking entries into the data TLB is shown in Example 3-3 on page 3-7. Intel® XScale™ Microarchitecture User’s Manual...
Only entries 0 through 30 can be locked in either TLB; entry 31 can never be locked. If the lock pointer is at entry 31, a lock operation will update the TLB entry with the translation and ignore the lock. In this case, the round-robin pointer will stay at entry 31. Intel® XScale™ Microarchitecture User’s Manual...
Instruction Cache The Intel® XScale™ core instruction cache enhances performance by reducing the number of instruction fetches from external memory. The cache provides fast execution of cached code. Code can also be locked down when guaranteed or fast access time is required. An additional 2Kbyte mini instruction cache is used exclusively during debugging, see Section 10.13.6...
Each external fetch request uses a fetch buffer that holds 32- bytes and eight valid bits, one for each word. A miss causes the following: 1. A fetch buffer is allocated. Intel® XScale™ Microarchitecture User’s Manual...
1 parity bit. The instruction cache tag is not parity protected. When a parity error is detected on an instruction cache access, a prefetch abort exception occurs if the Intel® XScale™ core attempts to execute the instruction. Before servicing the exception, hardware places a notification of the error in the Fault Status Register (Coprocessor 15, register 5).
Instruction Fetch Latency The instruction fetch latency is dependent on the core to memory frequency ratio, system bus bandwidth, system memory, etc. The outstanding external memory bus activity on the PXA255 processor will have the highest impact on instruction fetch latency.
; The instruction cache is guaranteed to be invalidated at this point; the next ; instruction sees the result of the invalidate command. The Intel® XScale™ core also supports invalidating an individual line from the instruction cache. Table 7-12, “Cache Functions” on page 7-9 for the exact command.
2: 28 ways locked, only way28-31 available for replacement set 31: all 32 ways available for round robin replacement set 31 set 0 set 1 set 2 way 0 way 1 way 7 way 8 way 22 way 23 way 30 way 31 Intel® XScale™ Microarchitecture User’s Manual...
4.3.5 Unlocking Instructions in the Instruction Cache The Intel® XScale™ core provides a global unlock command for the instruction cache. There is no unlock function for individual lines in the cache. Writing to coprocessor 15, register 9 unlocks all the locked lines in the instruction cache and leaves them valid.
The Intel® XScale™ core uses dynamic branch prediction to reduce the penalties associated with changing the flow of program execution. The Intel® XScale™ core features a branch target buffer that provides the instruction cache with the target address of branch type instructions. The branch target buffer is implemented as a 128-entry, direct mapped cache.
Once a branch is stored in the BTB, the history bits are updated upon every execution of the branch as shown in Figure 5-2. BTB Control 5.2.1 Disabling/Enabling The BTB is always disabled with Reset. Software enables the BTB through the Control Register bit[11] in coprocessor 15 (see Section 7.2.2). Intel® XScale™ Microarchitecture User’s Manual...
Section 7.2.7, “Register 7: Cache Functions” on page 7-9. 3. The BTB is invalidated when the Process ID Register is written. 4. The BTB is invalidated when the instruction cache is invalidated via CP15, register 7 functions. Intel® XScale™ Microarchitecture User’s Manual...
The Intel® XScale™ core data cache enhances performance by reducing the number of data accesses to and from external memory. There are two data cache structures in the Intel® XScale™ core, a 32 Kbyte data cache and a 2 Kbyte mini-data cache. An eight entry write buffer and a four entry fill buffer are also implemented to decouple the Intel®...
The mini-data cache is virtually addressed and virtually tagged and supports the same caching policies as the data cache. However, lines can not be locked into the mini-data cache. Intel® XScale™ Microarchitecture User’s Manual...
6.1.3 Write Buffer and Fill Buffer Overview The Intel® XScale™ core employs an eight entry write buffer, each entry containing 16 bytes. Stores to external memory are first placed in the write buffer and subsequently taken out when the bus is available.
If so, the current request is placed in the pending buffer and waits until the previously requested fill completes, after which it accesses the cache again, to obtain the request data and returns it to the destination register. Intel® XScale™ Microarchitecture User’s Manual...
For the PXA255 processor, the size of a data load depends also on the memory bank addressed in the access. For example, all 32-bit wide SDRAM reads are bursts of 4 words. All loads from this SDRAM generate a read of 4 words, despite that for uncacheable loads only the object the core requests will be used.
6.2.3.4 Write-Back Versus Write-Through The Intel® XScale™ core supports write-back caching or write-through caching, controlled through the MMU page attributes. When write-through caching is specified, all store operations are written to external memory even if the access hits the cache. This feature keeps the external memory coherent with the cache, i.e., no dirty bits are set for this region of memory in the data/...
This allocation evicts any cache dirty data back to external memory. Example 6-2 on page 6-9 shows how data cache can be cleaned. Intel® XScale™ Microarchitecture User’s Manual...
Page 63
It must reside in a page that is marked as mini Data Cache cacheable (see Section 2.3.2). The time it takes to execute a global clean operation depends on the number of dirty lines in cache. Intel® XScale™ Microarchitecture User’s Manual...
The data cache can only be unlocked by using the global unlock command See Table 7-14, “Cache Lockdown Functions” on page 7-11. The invalidate-entry command should not be issued to a locked line as this will render the line useless until a global unlock is issued. 6-10 Intel® XScale™ Microarchitecture User’s Manual...
Page 65
; in R1 to the next cache line. DRAIN SUBS R0, R0, #1; Decrement loop count BNE LOOP1 ; Turn off data cache locking DRAIN R2, #0x0 P15,0,R2,C9,C2,0 ; Take the data cache out of lock mode. CPWAIT Intel® XScale™ Microarchitecture User’s Manual 6-11...
Page 66
For this reason, system software should ensure the memory address used in the PLD is correct. If this cannot be ascertained, replace the PLD with a LDR instruction that targets a scratch register. 6-12 Intel® XScale™ Microarchitecture User’s Manual...
Before locking, the programmer must ensure that no part of the target data range is already resident in the cache. The Intel® XScale™ core will not refetch such data, which will result in it not being locked into the cache. If there is any doubt as to the location of the targeted memory data, the cache should be cleaned and invalidated to prevent this scenario.
Page 68
The write buffer and fill buffer support a drain operation, such that before the next instruction executes, all the Intel® XScale™ core data requests to external memory have completed. See Table 7-12, “Cache Functions” on page 7-9 for the exact command.
7-2. Any access to CP14 in user mode will cause an Undefined Instruction exception. Coprocessors CP15 and CP14 on the Intel® XScale™ core do not support access via CDP, MRRC, or MCRR instructions. An attempt to access these coprocessors with these instructions will result in an Undefined Instruction exception.
The Cache Type Register is selected when opcode_2=1 and describes the cache configuration of the Intel® XScale™ core. These values are device specific to the PXA255 processor, for the full set of potential values consult the ARM* Architecture Reference Manual.
Register 1 is made up of two registers, one that is compliant with ARM* Version 5 and referred by opcode_2 = 0x0, and the other which is specific to the Intel® XScale™ core is referred by opcode_2 = 0x1. The latter is known as the Auxiliary Control Register.
The configuration of the mini-data cache must be setup before any data access is made that may be cached in the mini-data cache. Once data is cached, software must ensure that the mini-data cache has been cleaned and invalidated before the mini-data cache attributes can be changed. Intel® XScale™ Microarchitecture User’s Manual...
Read / Write accessed when a data abort occurred Status - Used along with the X-bit above to determine the Read / Write type of cycle type that generated the exception. See “Event Architecture” on page 2-11 Intel® XScale™ Microarchitecture User’s Manual...
The Drain Write Buffer function not only drains the write buffer but also drains the fill buffer. The Intel® XScale™ core does not check permissions on addresses supplied for cache or TLB functions. Because only privileged software may execute these functions, full accessibility is assumed.
To invalidate the TLBs the commands below are required. All operations defined in Table 7-13 work regardless of whether the cache is enabled or disabled. This register is write-only. Reads from this register, as with an MRC, have an undefined effect. 7-10 Intel® XScale™ Microarchitecture User’s Manual...
31:1 Read-unpredictable / Write-as-Zero Reserved Data Cache Lock Mode (L) 0 = No locking occurs Read-unpredictable / Write 1 = Any fill into the data cache while this bit is set gets locked in Intel® XScale™ Microarchitecture User’s Manual 7-11...
7.2.11 Register 13: Process ID The Intel® XScale™ core supports the remapping of virtual addresses through a Process ID (PID) register. This remapping occurs before the instruction cache, instruction TLB, data cache and data TLB are accessed. The PID register controls when virtual addresses are remapped and to what value.
IBCR1), one data breakpoint address register (DBR0), one configurable data mask/address register (DBR1), and one data breakpoint control register (DBCON). The Intel® XScale™ core also supports a 2K byte mini instruction cache for debugging and a 256 entry trace buffer that records program execution information.
OS has to maintain a list of what processes are modifying CP0 and their associated state. A system programmer making this OS change should include code for coprocessors CP0 through CP13. Although the PXA255 processor only supports CP0, future products may implement additional coprocessor functionality from CP1-CP13.
7-25. To enter any of these modes, write the appropriate data to CP14, register 7 (PWRMODE). Software may read this register, but since software only runs during ACTIVE mode, it will always read zeroes from the M field. 7-16 Intel® XScale™ Microarchitecture User’s Manual...
10 through 13 support a 256 entry trace buffer. Register 14 and 15 are the debug link register and debug SPSR (saved program status register). These registers are explained in more detail in Chapter 10, “Software Debug”. Opcode_2 and CRm must be zero. Intel® XScale™ Microarchitecture User’s Manual 7-17...
Performance Monitoring This chapter describes the performance monitoring facility of the Intel® XScale™ core. The events that are monitored provide performance information for compiler writers, system application developers and software programmers. Overview The Intel® XScale™ core hardware provides two 32-bit performance counters that allow two unique events to be monitored simultaneously.
2 cycles it takes to generate an overflow interrupt. Performance Monitor Control Register (PMNC) The performance monitor control register (PMNC) is a coprocessor register that: • controls which events PMN0 and PMN1 will monitor Intel® XScale™ Microarchitecture User’s Manual...
PMNC register. The interrupt will remain asserted until software clears the overflow flag by writing a one to the flag that is set. Note that the PXA255 processor Interrupt Controller and the CPSR interrupt bit must be enabled in order for software to receive the interrupt.
PMN1 counts the number of instruction fetch requests to external memory. Each of these requests loads 32 bytes at a time due to the instruction fetch buffers, even when the memory page is marked as uncached. Intel® XScale™ Microarchitecture User’s Manual...
The average number of cycles the processor stalled waiting for an instruction fetch from external memory to return. This is calculated by dividing PMN0 by PMN1. If the average is high then the Intel® XScale™ core may be starved of memory access due to other bus traffic. •...
Performance Monitoring is high, possibly due to starvation, these Data Cache buffers will become full. This performance monitoring mode is provided to see if the Intel® XScale™ core is being starved of the bus external to the Intel® XScale™ core.
In this example, the events selected with the Instruction Cache Efficiency mode are monitored and CCNT is used to measure total execution time. Sampling time ends when PMN0 overflows which will generate an IRQ interrupt. Intel® XScale™ Microarchitecture User’s Manual...
Page 95
Instruction Cache miss-rate = 100 * PMN1/PMN0 = 5% CPI = (CCNT + 2^32)/Number of instructions executed = 2.4 cycles/instruction In the contrived example above, the instruction cache had a miss-rate of 5% and CPI was 2.4. Intel® XScale™ Microarchitecture User’s Manual...
JTAG, an acronym for the Joint Test Action Group. The JTAG interface on the application processor can be used as a hardware interface for software debugging of PXA255 systems. This interface is described in Chapter 10, “Software Debug.”...
Idcode instruction is selected. If TCK is pulsed, the contents of the ID register are clocked out of TDO. If the boundary-scan interface is not to be used, then the nTRST pin may be tied permanently low or to the nRESET pin. Intel® XScale™ Microarchitecture User’s Manual...
CAPTURE_DR state. While this instruction is in effect, all other IEEE 1149.1 11111 test data registers have no effect on the operation of the system. Test data Required registers with both test and system functionality perform their system functions when this instruction is selected. Intel® XScale™ Microarchitecture User’s Manual...
This is to prevent a scan operation from disabling power to the device and/or resetting external components. The following pins are not part of the boundary-scan shift-register: • PEXTAL • PXTAL • TEXTAL Intel® XScale™ Microarchitecture User’s Manual...
Page 102
JTAG reset (from forcing nTRST low or entering the Test Logic Reset state). The PXA255 256-pin PBGA package boundary scan pin order is shown in Figure 9-2 on page 9-6.
The high-order 4 bits of the ID register contains the version number of the silicon and changes with each new revision. There is no parallel output from the ID register. The 32-bit device identification code is loaded into the ID register from its parallel inputs during the CAPTURE-DR state. Intel® XScale™ Microarchitecture User’s Manual...
This prevents a scan operation from turning off power to the application processor. For greater detail on the state machine and the public instructions, refer to IEEE 1149.1 Standard Test Access Port and Boundary-Scan Architecture Document. Intel® XScale™ Microarchitecture User’s Manual...
The TAP controller enters the Run-Test/Idle state between scan operations. The controller remains in this state as long as TMS is held low. In the Run-Test/Idle state the instruction is runbist performed; the result is reported in the RUNBIST register. Instructions that do not call functions Intel® XScale™ Microarchitecture User’s Manual...
If TMS is held low on the rising edge of TCK, the controller enters the Pause-DR state. The instruction does not change while the TAP controller is in this state. All test data registers selected by the current instruction retain their previous value during this state. 9-10 Intel® XScale™ Microarchitecture User’s Manual...
The instruction does not change in this state. 9.5.11 Capture-IR State When the controller is in the Capture-IR state, the shift register contained in the instruction register loads the fixed value 0001 on the rising edge of TCK. Intel® XScale™ Microarchitecture User’s Manual 9-11...
The instruction shifted into the instruction register is latched onto the parallel output from the shift- register path on the falling edge of TCK. Once latched, the new instruction becomes the current instruction. Test data registers selected by the current instruction retain their previous values. 9-12 Intel® XScale™ Microarchitecture User’s Manual...
Page 109
If TMS is held high on the rising edge of TCK, the controller enters the Select-DR-Scan state. If TMS is held low on the rising edge of TCK, the controller enters the Run-Test/Idle state. Intel® XScale™ Microarchitecture User’s Manual 9-13...
Page 110
Test 9-14 Intel® XScale™ Microarchitecture User’s Manual...
The debugger can then restart execution of the application. The external debug interface to the PXA255 processor is via the JTAG port. Further details on the JTAG interface can be found in Section 9, “Test”.
JTAG interface. This is to allow an external debugger to have access to the internal state of the processor. For the details of which bits can be accessed see Table 10-8, Table 10-12 Table 10-3. 10-2 Intel® XScale™ Microarchitecture User’s Manual...
A debug exception is generated before the instruction in the exception vector executes. Software running on the Intel® XScale™ core must set the Global Enable bit and the debugger must set the Halt Mode bit and the appropriate vector trap bit through JTAG to set up a non-reset vector trap.
Buffer”. 10.4 Debug Exceptions A debug exception causes the processor to re-direct execution to a debug event handling routine. The Intel® XScale™ core debug architecture defines the following debug exceptions: 1. instruction breakpoint 2. data breakpoint 3. software breakpoint 4. external debug break 5.
Section 10.13, “Downloading Code into the Instruction Cache” on page 10-30 for details about downloading code into the instruction cache. During Halt mode, software running on the Intel® XScale™ core cannot access DCSR, or any of hardware breakpoint registers, unless the processor is in Special Debug State (SDS), described below.
The following debug exceptions cause data aborts: • data breakpoint • external debug break • trace-buffer full break When the vector table is relocated (CP15 Control Register[13] = 1), the debug vector is relocated to 0xFFFF_0000 Intel® XScale™ Microarchitecture User’s Manual 10-7...
10.5 HW Breakpoint Resources The Intel® XScale™ core debug architecture defines two instruction and two data breakpoint registers, denoted IBCR0, IBCR1, DBR0, and DBR1. The instruction and data address breakpoint registers are 32-bit registers. The instruction breakpoint causes a break before execution of the target instruction.
Single step execution is accomplished using the instruction breakpoint registers and must be completely handled in software (either on the host or by the debug handler). 10.5.2 Data Breakpoints The Intel® XScale™ core debug architecture defines two data breakpoint registers (DBR0, DBR1). The format of the registers is shown in Table 10-6.
On unaligned memory accesses, breakpoint address comparison is done on a word-aligned address (aligned down to word boundary). 10-10 Intel® XScale™ Microarchitecture User’s Manual...
All of the bits in the TXRXCTRL register are placed such that they can be read directly into the CC flags in the CPSR with an MRC (with Rd = PC). The subsequent instruction can then conditionally execute based on the updated CC value Intel® XScale™ Microarchitecture User’s Manual 10-11...
Before the high-speed download can start, both the debugger and debug handler must be synchronized, such that the debug handler is executing a routine that supports the high-speed download. 10-12 Intel® XScale™ Microarchitecture User’s Manual...
Table 10-10. High-Speed Download Handshaking States Debugger Actions Debugger wants to transfer code into the Intel® XScale™ core system memory. Prior to starting download, the debugger must poll the RR bit until it is clear. Once the RR bit is clear, indicating the debug handler is ready, the debugger starts the download.
RX and the data is ready for the debug handler to read. loop: p14, 0, r15, c14, c0, 0# read the handshaking bit in TXRXCTRL 10-14 Intel® XScale™ Microarchitecture User’s Manual...
JTAG), handshaking is required to prevent the debugger from writing new data to the register before the debug handler reads the previous data out. The handshaking is described in Section 10, “RX Register Ready Bit (RR)”. Intel® XScale™ Microarchitecture User’s Manual 10-15...
Debug Control and Status Register (DCSR). The debugger can only modify certain bits through JTAG, but can read the entire register. The SELDCSR instruction also allows the debugger to generate an external debug break. 10-16 Intel® XScale™ Microarchitecture User’s Manual...
Status Register (DCSR)” on page 10-3 are updated. An external host and the debug handler running on the Intel® XScale™ core must synchronize access to the DCSR. If one side writes the DCSR at the same time the other side reads the DCSR, the results are unpredictable.
Debug mode will not be entered until all processor activity has ceased in an orderly fashion. 10.10.2.3 DBG.DCSR The DCSR is updated with the value loaded into DBG.DCSR following an Update_DR. Only bits specified as writable by JTAG in Table 10-3 are updated. 10-18 Intel® XScale™ Microarchitecture User’s Manual...
A ‘1’ captured in DBG_SR[0] indicates the captured TX data is valid. After doing a Capture_DR, the debugger must place the JTAG state machine in the Shift_DR state to guarantee that a debugger read clears TXRXCTRL[28]. Intel® XScale™ Microarchitecture User’s Manual 10-19...
A Capture_DR loads TXRXCTRL[31] into DBG_SR[0]. The other bits in DBG_SR are loaded as shown in Figure 10-3. The captured data is scanned out during the Shift_DR state. While polling TXRXCTRL[31], incorrectly setting DBG_SR[35] or DBG_SR[1] will cause unpredictable behavior following an Update_DR. 10-20 Intel® XScale™ Microarchitecture User’s Manual...
The bits in the DBGRX data register (Figure 10-5) are used by the debugger to send data to the processor. The data register also contains a bit to flush previously written data and a high-speed download flag. Intel® XScale™ Microarchitecture User’s Manual 10-21...
DBG.RX is written into the RX register based on the output of the RX Write Logic. Any data that needs to be sent from the debugger to the processor must be loaded into DBG.RX with DBG.V set to 1. DBG.RX is loaded from DBG_SR[34:3] when the JTAG enters the Update_DR state. 10-22 Intel® XScale™ Microarchitecture User’s Manual...
DBG.D is provided for use during high speed download. This bit is written directly to TXRXCTRL[29]. The debugger sets DBG.D when downloading a block of code or data to the Intel® XScale™ core system memory. The debug handler then uses TXRXCTRL[29] as a branch flag to determine the end of the loop.
Read/Write target address for corresponding entry in trace buffer The two checkpoint registers (CHKPT0, CHKPT1) on the Intel® XScale™ core provide the debugger with two reference addresses to use for re-constructing the trace history. When the trace buffer is enabled, reading and writing to either checkpoint register has unpredictable results.
10.11.2 Trace Buffer Usage The Intel® XScale™ core trace buffer is 256 bytes in length. The first byte read from the buffer represents the oldest trace history information in the buffer. The last (256th) byte read represents the most recent entry in the buffer. The last byte read from the buffer will always be a message byte.
Bytes”). If the first non-zero entry is any other type of message byte, then these 0’s indicate that the trace buffer has not wrapped around and that first non-zero entry is the start of the trace. 10-26 Intel® XScale™ Microarchitecture User’s Manual...
MMMM = Message Type Bits M = Message Type Bit CCCC = Incremental Word Count VVV = exception vector[4:2] CCCC = Incremental Word Count Exception Format Non-exception Format Table 10-19 shows all of the possible trace messages. Intel® XScale™ Microarchitecture User’s Manual 10-27...
Non-exception Message Byte Non-exception message bytes are used for direct branches, indirect branches, and rollovers. In a non-exception message byte, the 4-bit message type field (MMMM) specifies the type of message (refer to Table 10-19). 10-28 Intel® XScale™ Microarchitecture User’s Manual...
MSB of the target address is read out first; the LSB is the fourth byte read out; and the indirect branch message byte is the fifth byte read out. The byte organization of the indirect branch message is shown in Figure 10-8. Intel® XScale™ Microarchitecture User’s Manual 10-29...
The Intel® XScale™ core supports loading either instruction cache during reset and during program execution. Loading the instruction cache during normal program execution requires a strict handshaking protocol between software running on the Intel® XScale™ core and the external host.
All LDIC functions and data consists of 33 bit packets which are scanned into LDIC_SR1 during the Shift_DR state. Update_DR parallel loads LDIC_SR1 into LDIC_REG which is then synchronized with the Intel® XScale™ core clock and loaded into the LDIC_SR2. Once data is loaded into LDIC_SR2, the LDIC State Machine turns on and serially shifts the contents if LDIC_SR2 to the instruction cache.
10.13.3 LDIC Cache Functions The Intel® XScale™ core supports four cache functions that can be executed through JTAG. Two functions allow an external host to download code into the main instruction cache or the mini instruction cache through JTAG. Two additional functions are supported to allow lines to be invalidated in the instruction cache.
• LDIC mode: active when LDIC JTAG instruction is loaded in the JTAG IR; prevents the mini instruction cache and the main instruction cache from being invalidated during reset. Intel® XScale™ Microarchitecture User’s Manual 10-33...
NOTE: In the Figure 10-11 hold_rst is a signal that gets set and cleared through JTAG When the JTAG IR contains the SELDCSR instruction, the hold_rst signal is set to the value scanned into DBG_SR[1]. 10-34 Intel® XScale™ Microarchitecture User’s Manual...
The Halt Mode bit must remain set to prevent the instruction cache from being invalidated. 9. When hold_rst is cleared, internal reset is de-asserted, and the processor executes the reset vector at address 0. Intel® XScale™ Microarchitecture User’s Manual 10-35...
In this last scenario, the mini instruction cache does not get invalidated by reset, since the processor is in Halt Mode. This scenario is described in more detail in this section. The last scenario described above is shown in Figure 10-12. 10-36 Intel® XScale™ Microarchitecture User’s Manual...
4) Place the LDIC JTAG instruction in the JTAG IR, then proceed with the normal code download, using the Invalidate IC Line function before loading each line. This requires 10 packets to be downloaded per cache line instead of the 9 packets as described in Section 10.13.3 Intel® XScale™ Microarchitecture User’s Manual 10-37...
The description in this section focuses on using a debug handler running on the Intel® XScale™ core to synchronize with the external host, but the details apply for any application that is running while code is dynamically downloaded.
In a very simple debug handler stub, the above parts may form the complete handler downloaded during reset (with some handler entry and exit code). When a debug exception occurs, routines can be downloaded as necessary. This allows the entire handler to be dynamic. Intel® XScale™ Microarchitecture User’s Manual 10-39...
JTAG LDIC function. Code downloaded into the mini instruction cache is essentially locked - it cannot be overwritten by application code running on the Intel® XScale™ core. It is not locked against code downloaded through the JTAG LDIC functions.
One possibility is to set up vector traps on the non-reset exception vectors. These vector locations can then be used to extend the reset vector. Intel® XScale™ Microarchitecture User’s Manual 10-41...
(non RRX) and MCR/MRC instructions, as a temporary scratch register. • The following instructions must not be executed in Debug Mode as they will result in unpredictable behavior: LDR w/ Rd=PC LDR w/ RRX addressing mode 10-42 Intel® XScale™ Microarchitecture User’s Manual...
10.14.2.3 Dynamic Debug Handler On the Intel® XScale™ core, the debug handler and override vector tables may reside in the 2 KB mini instruction cache, separate from the main instruction cache. A “static” Debug Handler is downloaded during reset. This is the base handler code, necessary to do common operations such as handler entry/exit, parse commands from the debugger, read/write ARM* registers, read/write memory, etc.
Using this assumption, the debugger does not have to poll RR to see whether the handler has read the previous data - it assumes the previous data has been consumed and immediately starts scanning in the next data word. 10-44 Intel® XScale™ Microarchitecture User’s Manual...
2. turn off all breakpoints; 3. invalidate the mini instruction cache; 4. invalidate the main instruction cache; 5. invalidate the BTB; These actions ensure that the application program executes correctly after the debugger has been disconnected. Intel® XScale™ Microarchitecture User’s Manual 10-45...
However, in this specific case, the overflow flag does not get set, so the debugger is unaware that the download was not successful. 10-46 Intel® XScale™ Microarchitecture User’s Manual...
The timings in this section are specific to the PXA255 processor, and how it implements the ARM* v5TE architecture. This is not a summary of all possible optimizations nor is it an explanation of the ARM* v5TE instruction set.
The load and store addressing modes implemented in the Intel® XScale™ core do not add to the instruction latencies numbers. The following section explains how to read these tables.
If the next instruction needs to use the result of the data processing for a shift by immediate or as Rn in a QDADD or QDSUB, one extra cycle of result latency is added to the number listed. 11-4 Intel® XScale™ Microarchitecture User’s Manual...
If the next instruction needs to use the result of the MRA for a shift by immediate or as Rn in a QDADD or QDSUB, one extra cycle of result latency is added to the number listed. 11.2.5 Saturated Arithmetic Instructions 11-6 Intel® XScale™ Microarchitecture User’s Manual...
1 for writeback of base STRB 1 for writeback of base STRBT 1 for writeback of base STRD 1 for writeback of base STRH 1 for writeback of base STRT 1 for writeback of base Intel® XScale™ Microarchitecture User’s Manual 11-7...
Minimum Interrupt Latency is defined as the minimum number of cycles from the assertion of any interrupt signal (IRQ or FIQ) to the execution of the instruction at the vector for that interrupt. An active system responding to an interrupt will typically depend predominantly on the PXA255 processor’s internal & external bus activity.
Intel® XScale™ core architecture. It is written for developers who are optimizing compilers or performance analysis tools for the Intel® XScale™ core based processors. It can also be used by application developers to obtain the best performance from their assembly language code. The optimizations presented in this chapter are based on the Intel®...
A.2.1.1. Number of Pipeline Stages The Intel® XScale™ core has a longer pipeline (7 stages versus 5 stages for StrongARM*) which operates at a much higher frequency than its predecessors do. This allows for greater overall performance. The longer Intel® XScale™ core pipeline has several negative consequences, however: •...
A.2.1.5. Use of Bypassing The Intel® XScale™ core pipeline makes extensive use of bypassing to minimize data hazards. Bypassing allows results forwarding from multiple sources, eliminating the need to stall the pipeline. Intel® XScale™ Microarchitecture User’s Manual...
The progress of an instruction can stall anywhere in the pipeline. Several pipestages may stall for various reasons. It is important to understand when and how hazards occur in the Intel® XScale™ core pipeline. Performance degradation could be significant if care is not taken to minimize pipeline stalls.
ALU calculation - the ALU performs arithmetic and logic operations, as required for data processing instructions and load/store index calculations. • Determine conditional instruction execution - The instruction’s condition is compared to the CPSR prior to execution of each instruction. Any instruction with a false condition is Intel® XScale™ Microarchitecture User’s Manual...
Multiply/Multiply Accumulate (MAC) Pipeline The Multiply-Accumulate (MAC) unit executes all multiply and multiply-accumulate instructions supported by the Intel® XScale™ core. The MAC implements the 40-bit Intel® XScale™ core accumulator register acc0 and handles the instructions, which transfer its value to and from general-purpose ARM* registers.
A.3.1 Conditional Instructions The Intel® XScale™ core architecture provides the ability to execute instructions conditionally. This feature combined with the ability of the Intel® XScale™ core instructions to modify the condition codes makes possible a wide array of optimizations. A.3.1.1.
The code generated above takes three cycles to execute the else part and four cycles for the if-part assuming best case conditions and no branch misprediction penalties. In the case of the Intel® XScale™ core, a branch misprediction incurs a penalty of four cycles. If the branch is mispredicted 50% of the time, and if we assume that both the if-part and the else-part are equally likely to be taken, on an average the code above takes 5.5 cycles to execute.
Page 175
Optimization Guide If we were to use the Intel® XScale™ core to execute instructions conditionally, the code generated for the above if-else statement is: r0, #10 movgt r0, #0 movle r0, #1 The above code segment would not incur any branch misprediction penalties and would take three cycles to execute assuming best case conditions.
The use of conditional instructions in the above fashion improves performance by minimizing the number of branches, thereby minimizing the penalties caused by branch mispredictions. This approach also reduces the utilization of branch prediction resources. A-10 Intel® XScale™ Microarchitecture User’s Manual...
Optimization Guide A.3.2 Bit Field Manipulation The Intel® XScale™ core shift and logical operations provide a useful way of manipulating bit fields. Bit field operations can be optimized as follows: ;Set the bit number specified by r1 in register r0...
A.3.5 Effective Use of Addressing Modes The Intel® XScale™ core provides a variety of addressing modes that make indexing an array of objects highly efficient. For a detailed description of these addressing modes please refer to the ARM* Architecture Reference Manual.
A.4.1.1. Cache Miss Cost The Intel® XScale™ core performance is highly dependent on reducing the cache miss rate. Note that this cache miss penalty becomes significant when the core is running much faster than external memory. Executing non-cached instructions severely curtails the processor's performance in this case and it is very important to do everything possible to minimize cache misses.
Data and Mini Caches A.4.2 Data and Mini Cache The Intel® XScale™ core allows the user to define memory regions whose cache policies can be set by the user (see Section 6.2.3, “Cache Policies”). Supported policies and configurations are: •...
Application performance can be improved by converting a part of the cache into on-chip RAM and allocating frequently used variables to it. Due to the Intel® XScale™ core round-robin replacement policy, all data will eventually be evicted. Therefore to prevent critical or frequently used data from being evicted it should be allocated to on-chip RAM.
In this case if tdata[] is not aligned to a cache line, then the prefetch using the address of tdata[i+1].ia may not include element id. If the array was aligned on a cache line + 12 bytes, then the prefetch would have to be placed on &tdata[i+1].id. A-16 Intel® XScale™ Microarchitecture User’s Manual...
A.4.2.7. Literal Pools The Intel® XScale™ core does not have a single instruction that can move all literals (a constant or address) to a register. One technique to load registers with literals in the Intel® XScale™ core is by loading the literal from a memory location that has been initialized with the constant or address.
Scheduling the prefetch instruction requires some understanding of the system latency times and system resources which affect when to use the prefetch instruction. For the PXA255 processor a cache line fill of 8 words from external memory will take more than 10 memory clocks, depending on external RAM speed and system timing configuration.
A.4.4.5. Bandwidth Limitations Overuse of prefetches can usurp resources and degrade performance. This happens because once the bus traffic requests exceed the system resource capacity, the processor stalls. The Intel® XScale™ core data transfer resources are: 4 fill buffers 4 pending buffers...
Similarly rearranging sections of data structures so that sections often written fit in the same half cache line [16 bytes for the Intel® XScale™ core] can reduce cache eviction write- backs. On a global scale, techniques such as array merging can enhance the spatial locality of the data.
This problem can be resolved by prefetch unrolling. For example consider: for(i=0; i<NMAX; i++) prefetch(data[i+2]); sum += data[i]; Intel® XScale™ Microarchitecture User’s Manual A-21...
Note the order reversal of the prefetches in relationship to the usage. If there is a cache conflict and data is evicted from the cache then only the data from the first prefetch is lost. A-22 Intel® XScale™ Microarchitecture User’s Manual...
However, the load ties up the receiving register until the data can be used. For example: r2, [r0] ; Process code { not yet cached latency > 30 core clocks } r1, r1, r2 Intel® XScale™ Microarchitecture User’s Manual A-23...
Scheduling Loads On the Intel® XScale™ core, an LDR instruction has a result latency of 3 cycles assuming the data being loaded is in the data cache. If the instruction after the LDR needs to use the result of the load, then it would stall for 2 cycles.
Page 191
LSL #2 r9, r9, #0xf r8, r6, r8 r6, [sp], #4 r8, r8, #4 r8, r8, #0xf r1, r6, r7 r3, r6, r2 ; The value in register r6 is not used after this Intel® XScale™ Microarchitecture User’s Manual A-25...
; The value in register r6 is not used after this The Intel® XScale™ core has 4 fill-buffers that are used to fetch data from external memory when a data-cache miss occurs. The Intel® XScale™ core stalls when all fill buffers are in use. This happens when more than 4 loads are outstanding and are being fetched from memory.
Similarly, the code sequence shown below takes 5 cycles to complete. r0, {r2, r3} r1, r1, #1 The alternative version which is shown below would only take 3 cycles to complete. strd r2, [r0] r1, r1, #1 Intel® XScale™ Microarchitecture User’s Manual A-27...
A.5.2 Scheduling Data Processing Instructions Most Intel® XScale™ core data processing instructions have a result latency of 1 cycle. This means that the current instruction is able to use the result from the previous data processing instruction. However, the result latency is 2 cycles if the current instruction needs to use the result of the previous data processing instruction for a shift by immediate.
Similarly, the code shown below would incur a 2 cycle penalty due to the 3-cycle result latency for the second destination register. r6, r7, acc0 r1, r7 r0, r6 r2, r2, #1 Intel® XScale™ Microarchitecture User’s Manual A-29...
The MRS instruction has an issue latency of 1 cycle and a result latency of 2 cycles. The MSR instruction has an issue latency of 2 cycles (6 if updating the mode bits) and a result latency of 1 cycle. A-30 Intel® XScale™ Microarchitecture User’s Manual...
Optimizing for smaller code size will, in general, lower the performance of your application. These are some techniques for optimizing for code size using the Intel® XScale™ core instruction set. Many optimizations mentioned in the previous chapters improve the performance of ARM* code.
32-bit ARM* code. However, in some unusual cases where Instruction Cache size is a significant influence, being able to hold more Thumb instructions in cache may aid performance. Whatever the performance outcome, Thumb coding significantly reduces code size. A-32 Intel® XScale™ Microarchitecture User’s Manual...
Need help?
Do you have a question about the PXA255 and is the answer not in the manual?
Questions and answers