Core 2 duo mobile processor, intel core 2 solo mobile processor and intel core 2 extreme mobile processor on 45-nm process, platforms based on mobile intel 4 series express chipset family (113 pages)
Intel pentium processor on 45-nm process, platforms based on mobile intel 4 series express chipset family (74 pages)
Summary of Contents for Intel PXA270
Page 1
Intel® PXA27x Processor Family Optimization Guide April, 2004 Order Number: 280004-001...
Page 2
Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the express written consent of Intel Corporation.
Page 4
Optimizing Arbiter Settings .................3-15 3.5.2.1 Arbiter Functionality ................3-15 3.5.2.2 Determining the Optimal Weights for Clients ........3-15 3.5.2.3 Taking Advantage of Bus Parking............3-16 3.5.2.4 Dynamic Adaptation of Weights............3-16 3.5.3 Usage of DMA ....................3-17 3.5.4 Peripheral Bus Split Transactions...............3-17 Intel® PXA27x Processor Family Optimization Guide...
Page 5
Contents Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization ..4-1 Introduction ........................4-1 General Optimization Techniques ..................4-1 4.2.1 Conditional Instructions and Loop Control ............4-1 4.2.2 Program Flow and Branch Instructions..............4-2 4.2.3 Optimizing Complex Expressions .................4-5 4.2.3.1 Bit Field Manipulation................4-6 4.2.4 Optimizing the Use of Immediate Values..............4-6...
Page 7
Data Cache and Buffer Behavior when X = 0 ................3-3 Data Cache and Buffer Behavior when X = 1 ................3-3 Data Cache and Buffer operation comparison for Intel® SA-1110 and Intel XScale® Microarchitecture, X=0.......................3-4 Sample LCD Configurations with Latency and Peak Bandwidth Requirements......3-13 Memory to Memory Performance Using DMA for Different Memories and Frequencies..3-17...
Page 8
Resource Availability Delay for the Multiplier Pipeline.............4-48 4-22 Resource Availability Delay for the Memory Pipeline ..............4-48 4-23 Resource Availability Delay for the Coprocessor Interface Pipeline........4-49 Power Modes and Typical Power Consumption Summary ............6-3 viii Intel® PXA27x Processor Family Optimization Guide...
Wireless Intel Speedstep® technology for ultra-low-power, Intel® Wireless MMX™ technology and up to 624 MHz for advanced multimedia capabilites, and Intel® Quick Capture Interface to give customers the ability to capture high quality images and video.
1.2.1 Intel XScale® Microarchitecture and Intel XScale® core The Intel XScale® Microarchitecture is based on a core that is ARM* version 5TE compliant. The microarchitecture surrounds the core with instruction and data memory management units; instruction, data, and mini-data caches; write, fill, pend, and branch-target buffers; power management, performance monitoring, debug, and JTAG units;...
This coprocessor, characterized by a 64-bit Single Instruction Multiple Data (SIMD) architecture and compatibility with the integer functionality of the Intel® Wireless MMX™ technology and SSE instruction sets, is known by its Intel project name, Intel® Wireless MMX™ technology. The key features of this coprocessor are: •...
Introduction • Superset of existing Intel XScale® Microarchitecture media processing instructions See the Intel® Wireless MMX™ technology Coprocessor EAS for more details. 1.2.4 Memory Architecture 1.2.4.1 Caches There are two caches: • Data cache – The PXA27x processor supports 32 Kbytes of data cache.
1.2.5.2 Peripheral Bus The peripheral bus is a single master bus. The bus master arbitrates between the Intel XScale® core and the DMA controller with a pre-defined priority scheme between them. The peripheral bus is used by the low-bandwidth peripherals; the peripheral bus runs at 26 MHz.
• Switchable clock source • Functional clock gating • Programmable frequency-change capability 121 GPIOs are available on the PXA271 processor, PXA271 processor, and PXA271 processor. The PXA270 processor only has 119 GPIOs bonded out. Intel® PXA27x Processor Family Optimization Guide...
Backward compatibility for user-mode applications is maintained with the earlier generations of StrongARM* and Intel XScale® Microarchitecture processors. Operating systems may require modifications to match the specific Intel XScale® Microarchitecture hardware features, and to take advantage of the performance enhancements added to this core.
Page 19
JTAG interface and a 256-entry trace buffer • Integrated memory controller with support for SDRAM, flash memory, synchronous ROM, SRAM, variable latency I/O (VLIO) memory, PC card, and compact flash expansion memory. • Six power-management modes Intel® PXA27x Processor Family Optimization Guide...
Page 20
Introduction 1-10 Intel® PXA27x Processor Family Optimization Guide...
The following sections discuss general pipeline characteristics. 2.2.1.1 Pipeline Organization The Intel XScale® Microarchitecture has a 7-stage pipeline operating at a higher frequency than its predecessors allowing for greater overall performance. The Intel XScale® Microarchitecture single-issue superpipeline consists of a main execution pipeline, a multiply-accumulate {MAC} pipeline, and a memory access pipeline.
The Intel XScale® Microarchitecture only preserves a weak processor consistency because instructions complete out of order (assuming no data dependencies exist). The Intel XScale® Microarchitecture can buffer up to four outstanding reads. If load operations miss the data cache, subsequent instructions complete independently. This operation is called a hit-under-miss operation.
The IFU is responsible for delivering instructions to the instruction decode (ID) pipestage. It delivers one instruction word each cycle (if possible) to the ID. The instruction could come from one of two sources: instruction cache or fetch buffers. Intel® PXA27x Processor Family Optimization Guide...
Branch target determinations – the X1 pipestage flushes all instructions in the previous pipestages and sends the branch target address to the BTB if a branch is mispredicted by the BTB. The flushing of these instructions restarts the pipeline. Intel® PXA27x Processor Family Optimization Guide...
4-word aligned address of the existing entry. The core can coalesce any of the four entries in the write buffer. The Intel XScale® Microarchitecture has a global coalesce disable bit located in the Control register (CP15, register 1, opcode_2=1).
The MAC can achieve throughput of one multiply per cycle when performing a 16-by-32-bit multiply. • ACC registers in the Intel XScale® Microarchitecture can be up to 64 bits in future implementations. Code should be written to depend on the 40-bit nature of the current implementation.
RF stage. The register file is accessed for reads in the high phase of the clock and accessed for writes in the low phase.If data or resource hazards are detected, the Intel® Intel® PXA27x Processor Family Optimization Guide...
Microarchitecture Overview Wireless MMX™ Technology stalls Intel XScale® Microarchitecture. Note that control hazards are detected in the Intel XScale® Microarchitecture, and a flush signal is sent from the core to the Intel® Wireless MMX™ Technology. 2.3.1.3 X1 Stage The X1 stage is also known as the execution stage, which is where most instructions begin being executed.
Memory Pipeline Thread 2.3.3.1 D1 Stage In the D1 pipe stage, the Intel XScale® Microarchitecture provides a virtual address that is used to access the data cache. There is no logic inside the Intel® Wireless MMX™ Technology in the D1 pipe stage.
Because the PXA27x processor has a multi-transactional internal bus, there are latencies involved with accesses to and from the Intel XScale® core. The internal bus, also called the system bus, allows many internal operations to occur concurrently such as LCD, DMA controller and related data transfers.
If CCCR[A] is cleared, use the “Core PLL Output Frequencies for 13-MHz Crystal with CCCR[A] = 0” table in the Intel® PXA27x Processor Family Developer’s Manual when making the clock setting selections. If CCCR[A] is set, use the “Core PLL Output Frequencies for 13-MHz Crystal...
If the X bit for a descriptor is one, the C and B bits behave differently, as shown in Table 3-4. The load and store buffer behavior in Intel XScale® Microarchitecture is explained in Section 2.2.4.1.1, “Write Buffer Behavior” Section 2.2.4.1.2, “Read Buffer Behavior”...
Allocate † Normally, "bufferable" writes can coalesce with previously buffered data in the same address range †† Refer to Intel XScale® Core Developer’s Manual and the Intel® PXA27x Processor Family Developer’s Manual for a description of this register. Note: The Intel XScale® Microarchitecture page-attributes are different than the Intel® StrongARM* SA-1110 Microprocessor (SA-1110).
There are different techniques which can be used to increase the data cache performance. These include, optimizing cache configuration and programming techniques etc. This section offers a set of system-level optimization opportunities; however program-level optimization techniques are equally important. Intel® PXA27x Processor Family Optimization Guide...
System Level Optimization 3.3.2.1 Cache Configuration The Intel XScale® Microarchitecture allows users to define memory regions whose cache policies can be set by the user. To support these various memory regions, OS configures the page-tables accordingly. The performance of application code depends on what cache policy used for data objects. A description of when to use a particular policy is described below.
Due to the Intel XScale® Microarchitecture round robin replacement policy, all non-locked cache data will eventually be evicted. Therefore, to prevent critical or frequently used data from being evicted it can be allocated to on-chip RAM.
3.3.3 Optimizing TLB (Translation Lookaside Buffer) Usage The Intel XScale® Microarchitecture offers 32 entries for instruction and data TLBs. The TLB unit also offers a hardware page-table walk. This eliminates the need for using a software page table walk and software management of the TLBs.
System Level Optimization The Intel XScale® Microarchitecture allows individual entries to be locked in the TLBs. Each locked TLB entry reduces the number of TLB entries available to hold other translation information. The entries one would expect to lock in the TLBs are those used during access to locked cache lines.
During context switch the states of the process has to be saved. For the PXA27x processor, the PCB (process control block) can be large in size due to additional registers for Intel® Wireless MMX™ Technology. In order to reduce context switch latency the internal memory can be employed.
BPP is bits per pixel in physical memory, that is: 16 for 16 BPP, 32 for 18 BPP unpacked, 24 for 18 BPP packed (refer to the Intel® PXA27x Processor Family Developer’s Manual for more info).
Page 42
) = 72 system bus cycles Note that each LCD DMA channel has a 16-entry, 8-byte wide FIFO buffer to help deal with fluctuations in available bandwidth due to spikes in system activity. 3-12 Intel® PXA27x Processor Family Optimization Guide...
SRAM. This method requires close coordination with the LCD controller to ensure that no artifacts are seen on the LCD. Refer to the LCD chapter in the Intel® PXA27x Processor Family Developer’s Manual for more information on reconfiguring the LCD.
This work can be off-loaded to the LCD controller by properly configuring the LCD controller. This has two advantages; first, the Intel XScale® core is not burdened with the processing, and second, the LCD bandwidth consumption is lowered by using the lower bit precision format.
The PXA27x processor arbiter features programmable “weights” for the LCD controller, DMA controller, and Intel XScale® Microarchitecture bus requests. In addition, the “park” bit can be set which causes the arbiter to grant the bus to a specific client whenever the bus is idle. These two features should be used to tune PXA27x processor to match your system bandwidth requirements.
3.5.2.2.3 Weight for Core A good method for setting the Intel XScale® core weight and the DMA controller weight is to determine the ratio of the bandwidth requirements of both. Once the ratio is determined the weights can be programmed with that same ratio. For instance, if the Intel XScale® core requires twice the bandwidth of the DMA controller, the DMA weight could be set to two with the Intel XScale®...
Ratio = Core Frequency : System Bus Frequency : Memory Bus Frequency Proper DMA controller usage can reduce the workload of the processor by allowing the Intel XScale® core to use the DMA controller to perform peripheral I/O. The DMA can also be used to populate the internal memory from the capture interface or external memory, etc.
Page 48
System Level Optimization 3-18 Intel® PXA27x Processor Family Optimization Guide...
Wireless MMX™ Technology Optimization Introduction This section outlines optimizations specific to ARM* architecture and also to the Intel® Wireless MMX™ Technology. These optimizations are modified for the Intel XScale® Microarchitecture where needed. This chapter focuses mainly on the assembly code level optimization.
The Intel® PXA27x Processor Family (PXA27x processor) add a branch target buffer (BTB) which helps mitigate the penalty due to branch misprediction. However, the BTB must be enabled.
Page 51
This code takes three cycles to execute the else statement and four cycles for the if statement assuming best case conditions and no branch misprediction penalties. In the case of the Intel XScale® Microarchitecture, a branch misprediction incurs a penalty of four cycles. If the branch is mispredicted 50% of the time and if both the if statement and the else statement are equally likely to be taken, on an average the code above takes 5.5 cycles to execute.
Page 52
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization P1 Percentage of times the if_stmt is likely to be executed P2 Percentage of times likely to incur a branch misprediction penalty Number of cycles to execute the if-else portion using conditional instructions assuming...
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization 4.2.3.1 Bit Field Manipulation The Intel XScale® Microarchitecture shift and logical operations provide a useful way of manipulating bit fields. Bit field operations can be optimized as: ;Set the bit number specified by r1 in register r0...
4.3.1.1 Scheduling Loads On the Intel XScale® Microarchitecture, an LDR instruction has a result latency of 3 cycles, assuming the data being loaded is in the data cache. If the instruction after the LDR needs to use the result of the load, then it would stall for 2 cycles. If possible, rearrange the instructions surrounding the LDR instruction to avoid this stall.
Page 57
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization In the code shown in the following example, the ADD instruction following the LDR stalls for two cycles because it uses the result of the load. r1, r2, r3 r0, [r5]...
Page 58
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization In the following code sample, the ADD and LDR instructions can be moved before the MOV instruction. This helps prevent pipeline stalls if the load hits the data cache. However, if the load is likely to miss the data cache, move the LDR instruction so it executes as early as possible—before...
The Intel XScale® Microarchitecture has four fill buffers used to fetch data from external memory when a data cache miss occurs. The Intel XScale® Microarchitecture stalls when all fill buffers are in use. This happens when more than four loads are outstanding and are being fetched from memory.
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization ldr r4, [r0], #32 bne Loop The modified code not only hides the load-to-use latencies for the cases of cache-hits, but also increases the throughput by allowing several loads to be outstanding at a time.
4.3.1.4 Scheduling Load Double and Store Double (LDRD/STRD) The Intel XScale® Microarchitecture introduces two new double word instructions: LDRD and STRD. LDRD loads 64 bits of data from an effective address into two consecutive registers. STRD stores 64 bits from two consecutive registers to an effective address. There are two important restrictions on how these instructions are used: •...
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization 4.3.1.5 Scheduling Load and Store Multiple (LDM/STM) LDM and STM instructions have an issue latency of 2 to 20 cycles depending on the number of registers being loaded or stored. The issue latency is typically two cycles plus an additional cycle for each of the registers loaded or stored assuming a data cache hit.
4.3.1.6 Scheduling Data-Processing Most Intel XScale® Microarchitecture data-processing instructions have a result latency of one cycle. This means that the current instruction uses the result from the previous data processing instruction. However, the result latency is two cycles if the current instruction uses the result of the previous data processing instruction for a shift by immediate.
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization Due to result latency, this code segment incurs a stall of 1-3 cycles depending on the values in registers R1 and R2: r0, r1, r2 r4, r0 A multiply instruction that sets the condition codes blocks the whole pipeline. A four-cycle multiply operation that sets the condition codes behaves the same as a four-cycle issue operation.
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization r2, [r1] r3, [r0] 4.3.1.9 Scheduling the MRA and MAR Instructions (MRRC/MCRR) The MRA (MRRC) instruction has an issue latency of one cycle, a result latency of two or three cycles depending on the destination register value being accessed, and a resource latency of two cycles.
4.3.2 Instruction Scheduling for Intel® Wireless MMX™ Technology The Intel® Wireless MMX™ Technology provides an instruction set which offers the same functionality as the Intel® Wireless MMX™ Technology and Streaming SIMD Extensions (SSE) integer instructions. 4.3.2.1 Increasing Load Throughput on Intel® Wireless MMX™ Technology The constraints on issuing load transactions with Intel XScale®...
In the above example, the WLDRD and WALIGNI instructions do not incur a stall since they are utilizing the memory and execution pipelines respectively and there are no data dependencies. When utilizing both Intel XScale® Microarchitecture and Intel® Wireless MMX™ Technology execution resources, it is also possible to overlap the multicycle instructions. The ADD instruction in the following example executes with no stalls.
• Compute-intensive processing In the following sections we illustrate how the rules for writing fast sequences of Intel® MMX™ Technology instructions on Intel® Wireless MMX™ Technology can be applied to the optimization of short loops of Intel® MMX™ Technology code.
Page 70
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization − ∑ ⋅ − ∨ ≤ ≤ − or, in C-code, for (i = 0; i < N; i++) { s = 0; for (j = 0; j < T; j++) { s += a[j]*x[i-j]);...
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization WMACS wR2, wR1, wR0 SUBS r3, r3, #4 BNE Loop_Begin The parallelism of the filter may be exposed further by unrolling the loop to provide for eight taps per iteration. In the following code sequence, the loop has been unrolled once allowing several load-to-use stalls to be eliminated.
Page 72
The output samples y(n), y(n+1), y(n+2), and y(n+3) are assigned to four 64-bit Intel® Wireless MMX™ Technology registers. In order to obtain near ideal throughput, the inner loop is unrolled to provide for eight taps for each of the four output samples per loops iteration.
The multi-sample technique may be applied whenever the same data is being utilized for multiple calculations. The large register file on Intel® Wireless MMX™ Technology facilitates this approach and a number of variations are possible.
Porting Existing Intel® MMX™ Technology Code to Intel® Wireless MMX™ Technology The re-use of existing Intel® MMX™ Technology code is encouraged since algorithm mapping to Intel® Wireless MMX™ Technology may be significantly accelerated. The Intel® MMX™ Technology target pipeline and architecture is different than Intel® Wireless MMX™ Technology and several changes are required for optimal mapping.
• The Intel® Wireless MMX™ Technology instructions provide encoding for three registers unlike the Intel® MMX™ Technology instructions which provide for two registers only. The destination registers may be different from the source registers when converting Intel® MMX™ Technology code to Intel® Wireless MMX™ Technology. Remove all code sequences in Intel®...
— WSHUFH PSHUFW Following is a set of examples showing subtle differences between Intel® MMX™ Technology and Intel® Wireless MMX™ Technology codes. The number of cases have been limited for the sake of brevity. 4.5.2 Unsigned Unpack Example The Intel® Wireless MMX™ Technology provides instructions for unpacking 8 bit, 16 bit, or 32 bit data and either sign-extending or zero extending.
Optimizing Libraries for System Performance Many of the standard C library routines can benefit greatly by being optimized for the Intel XScale® Microarchitecture. The following string and memory manipulation routines are good candidates to be tuned for the Intel XScale® Microarchitecture.
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization Using preloads appropriately, the code can be desensitized to the memory latency (preload and prefetches are the same). Preloads are described further in Section 5.1.1.1.2, “Preload Loop Scheduling” on page 5-2. The following code performs memcpy with optimizations for latency desensitization.
Case Study 3: Dot Product Dot product is a typical vector operation for signal processing applications and graphics. For example, vertex transformation uses a graphic dot product. Using Intel® Wireless MMX™ Technology features can help accelerate these applications. The following code demonstrates how to attain this acceleration.
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization • Schedule around the load-to-use-latency This example code is for the dot product: ; r0 points to source vector 1 ; r1 points to source vector 2 WLDRD wR0, [r0], #8...
Bi-linear interpolation is a typical operation in image and video processing applications. For example the video decode motion compensation uses the 1/2X interpolation operation. Using Intel® Wireless MMX™ Technology features can help to accelerate these key applications. The following code demonstrates how to attain this acceleration. These items are key issues for optimizing the 1/2X motion compensation: •...
• Incorporate a library with optimizations present. For the last item, a listing of fully optimized code, the Intel® Integrated Performance Primitives (IPP) is available. The IPP comprises a rich and powerful set of general and multimedia signal processing kernels optimized for high performance on the PXA27x processor. Besides...
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization The IPP include optimized general signal and image processing primitives, as well as primitives for use in constructing internationally standardized audio, video, image, and speech encoder/decoders (CODECs) for the PXA27x processor.
Page 84
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization • Issue Latency The cycle distance from the first issue clock of the current instruction to the issue clock of the next instruction. Cache-misses, resource-dependency stalls, and resource availability conflicts can influence the actual number of cycles.
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization Table 4-2 shows how to calculate issue latency and result latency for each instruction. The UMLAL instruction (shown in the issue column) starts to issue on cycle 0 and the next instruction, ADD, issues on cycle 2, so the issue latency for UMLAL is two.
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization 4.8.3 Data Processing Instruction Timings Table 4-5. Data Processing Instruction Timings <shifter operand> is a Shift/Rotate by <shifter operand> Is Not a Shift/Rotate Register Or by Register <shifter operand> is RRX...
Minimum Issue Latency Minimum Result Latency † † MRC to R15 is unpredictable / MRC and MCR to CP0 and CP1 is described in the Intel® Wireless MMX™ Technology section Table 4-15. CP14 Register Access Instruction Timings Instruction Minimum Issue Latency...
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization 4.8.11 Thumb* Instructions In general, the timing of THUMB* instructions is the same as their equivalent ARM* instructions, except for these cases: • If the equivalent ARM* instruction maps to an entry in Table 4-3, the “Minimum Issue...
Page 92
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization Table 4-18. Issue Cycle and Result Latency of the PXA27x processor Instructions (Sheet 2 of Instructions Issue Cycle Result Latency WMAC TMIA TMIAPH TMIAxy WSLL WSRA WSRL WROR WPACK WUNPCKEH WUNPCKEL...
• If the Intel XScale® Microarchitecture MAC unit is in use, the resulting latency of a TMRC, TMRRC, and TEXRM increases accordingly. 4.10.2 Resource Hazard A resource hazard is caused when an instruction requires a resource that is already in use.
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization Figure 4-1 shows a high-level representation of the operation of the PXA27x processor coprocessor. After the register file, there are four concurrent pipelines to which an instruction can be dispatched. An instruction can be issued to a pipeline if the resource is available and there are no unresolved data dependencies.
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization 4.10.2.4 Coprocessor Interface Pipeline The coprocessor interface pipeline also contains buffering to allow multiple outstanding MRC/MRRC operations. The coprocessor interface pipeline can continue to accept MRC and MRRC instructions every cycle until its buffers are full. Currently there is sufficient storage in the buffer for either four MRC data values (32-bit) or two MRRC data values (64-bit).Table 4-23...
XScale® Microarchitecture preload can be used to reduce register pressure instead of increasing it. The Intel XScale® Microarchitecture preload is a hint instruction and does not guarantee that the data is loaded. Whenever the load would cause a fault or a table walk, then the processor ignores the preload instruction, the fault or table walk, and continue processing the next instruction.
Page 100
This makes it easy to predict where to fetch the data. The number of iterations to preload ahead is referred to as the preload scheduling distance (PSD). For the Intel XScale® Microarchitecture this can be calculated as: ×...
5.1.1.2.3 Preload Limitations: Bandwidth Consumption Overuse of preloads can usurp resources and degrade performance. This happens because once the bus traffic requests exceed the system resource capacity, the processor stalls. Intel XScale® Microarchitecture data transfer resources are: • 4 fill buffers •...
+= data[i]; Interactions i-1 and i, preload superfluous data. The problem can be avoid by unrolling the end of the loop. for(i=0; i<NMAX-2; i++) prefetch(data[i+2]); sum += data[i]; sum += data[NMAX-2]; sum += data[NMAX-1]; Intel® PXA27x Processor Family Optimization Guide...
Page 103
The variables are pre-loaded in the opposite order that they are used. If there is a cache conflict and data is evicted from the cache then only the data from the first preload is lost. Intel® PXA27x Processor Family Optimization Guide...
Stride (the way data structures are walked through) can affect the temporal quality of the data and reduce or increase cache conflicts. Intel XScale® Microarchitecture data cache and mini-data caches each have 32 sets of 32 bytes. This means that each cache line in a set is on a modular 1K- address boundary.
Page 105
1 by rearranging the fields in the above data structure as shown: struct employee { struct employee *prev; struct employee *next; int ssno; int empid; float Year2DatePay; float Year2DateTax; float Year2Date401KDed; float Year2DateOtherDed; Intel® PXA27x Processor Family Optimization Guide...
Spatially dispersing the data comprising one data set (for example, an array or structure) throughout a memory range instead of keeping the data in contiguous memory locations. Intel® PXA27x Processor Family Optimization Guide...
5.1.6 Loop Unrolling Most compilers unroll fixed length loops when compiled with speed optimizations. Intel® PXA27x Processor Family Optimization Guide...
Page 108
= 4; int nTotalBlockIters; int i; // find the largest multiple of nItersPerBlock that is less than or equal to nTotalIterations nTotalBlockIters = (nTotalIterations / nItersPerBlock) * 5-10 Intel® PXA27x Processor Family Optimization Guide...
For example, here is a typical for() loop. for (i=0; i<1000; ++i) p1();} This code provides the same behavior without as much loop overhead. for (i=1000; i>0; --i) p1();} 5-11 Intel® PXA27x Processor Family Optimization Guide...
Packed data formats can also be processed using the Intel® Wireless MMX™ Technology. The Intel XScale® Microarchitecture performs best on word-size data aligned on a 4-byte boundary. Intel® Wireless MMX™ Technology requires data to be aligned on a 8-byte boundary.
Page 111
In this case, the preload address was advanced by the size of half a cache line and every other preload instruction is ignored. Further, an additional register is required to track the next preload address. Generally, not aligning and sizing data adds extra computational overhead. 5-13 Intel® PXA27x Processor Family Optimization Guide...
5.1.13 Placing Literal Pools The Intel XScale® Microarchitecture does not have a single instruction that can load a literal (a constant or address) to a register; all literals require multiple instructions to load. One technique to load registers with literals in the Intel XScale® Microarchitecture is by loading the literal from a memory location that has been initialized with the constant or address.
Page 113
Passing by pointer or reference is highly preferred over passing by value. Passing by value should only be used when there is a compelling reason to do so. Small data types (4 bytes or less in size) are the exception. 5-15 Intel® PXA27x Processor Family Optimization Guide...
Page 114
High Level Language Optimization 5-16 Intel® PXA27x Processor Family Optimization Guide...
The major topics covered in this section include considerations for reducing the power consumption of the Intel XScale® core and memory. The power savings and performance tradeoffs vary depending on the user’s system configuration.
All power domains except VCC_RTC, and VCC_OSC are placed in a low-power mode where state is retained but no activity is allowed, some of the internal power domains (see the Intel® PXA270 Processor Electrical, Mechanical, and Thermal Specification and the Intel® PXA27x Processor Family Electrical, Mechanical, and Thermal Specification ) can be powered off, and both PLLs are disabled;...
The power savings realized through the use of Wireless Intel Speedstep® Technology Power Manager can be substantial and are an important part of the Wireless Intel Speedstep® Technology. There are some additional considerations and additions required by applications in order to take advantage of the power manager, but these additions were minimal.
208 MHz. The system bus runs at 52 MHz and the memory bus runs at 104 MHz. In all cases, the Intel XScale® core is running at a frequency of 208 MHz. The tradeoffs between the four cases are the speed of the system bus and memory bus versus power consumption. The lower the frequency of the either bus, the lower the core-power consumption.
This minimizes the power usage of the external memory bus, which is a major component of total system power. Refer to the Programmable Output Buffer Strength registers in the Intel® PXA27x Processor Family Developer’s Manual for more information.
6.3.7.1 Normal Mode It may be require less power to run at a higher Intel XScale® core frequency/voltage to complete a task and then drop the operating frequency and Intel XScale® core voltage than to run at a constant frequency and voltage. Profile the OS and applications to determine the optimum Intel XScale®...
Power Optimization 6.3.7.3 Deep-Idle Mode Use deep-idle mode instead of idle mode whenever the time required without Intel XScale® core operation is long enough to accomplish the necessary voltage and frequency changes. 6.3.7.4 Standby Mode For lowest power consumption in standby mode: •...
Page 122
Power Optimization Intel® PXA27x Processor Family Optimization Guide...
Use cache policies to optimize throughput and performance • Use the internal SRAM • Park the system bus arbiter on the Intel XScale® core unless performing task which heavily uses a different system bus client • Make the LCD frame buffer non-cached but bufferable •...
Use half-turbo mode to reduce the core clock frequency without impacting bus operations • Use minimum Memory Buffer register strength settings possible • Use minimum LCD Buffer Strength register settings possible • Use Intel® Quick Capture Interface to bring image data in YCbCr mode, when possible Intel® PXA27x Processor Family Optimization Guide...
Page 125
(B/s). The size of a network “pipe” or channel for communications in wired networks. In wireless, it refers to the range of available frequencies that carry a signal. Base Station The telephone company’s interface to the Mobile Station BGA Ball Grid Array Glossary-1 Intel® PXA27x Processor Family Optimization Guide...
Page 126
Several versions of the standard are still under development. CDMA should increase network capacity for wireless carriers and improve the quality of wireless messaging. CDMA is an alternative to GSM. Glossary-2 Intel® PXA27x Processor Family Optimization Guide...
Page 127
The default address is 00H. Default Pipe The message pipe created by the USB System Software to pass control and status information between the host and a USB device’s endpoint zero. Glossary-3 Intel® PXA27x Processor Family Optimization Guide...
Page 128
GSM networks. EEPROM See Electrically Erasable Programmable Read Only Memory. Electrically Erasable Programmable Read Only Memory (EEPROM) Non-volatile re-writable memory storage technology. End User The user of a host. Glossary-4 Intel® PXA27x Processor Family Optimization Guide...
Page 129
Fs See sample rate. FSR Fault Status Register, part of the ARM* architecture. Full-duplex Computer data transmission occurring in both directions simultaneously. Glossary-5 Intel® PXA27x Processor Family Optimization Guide...
Page 130
Hub A USB device that provides additional connections to the USB. Hub Tier One plus the number of USB links in a communication path between the host and a function. IMMU Instruction Memory Management Unit, part of the Intel XScale® core. Glossary-6...
Page 131
Transmission rate expressed in kilobits per second. A measurement of bandwidth in the U.S. kB/s Transmission rate expressed in kilobytes per second. Glossary-7 Intel® PXA27x Processor Family Optimization Guide...
Page 132
MMC Multimedia Card - small form factor memory and I/O card MMX Technology The Intel® MMX™ technology comprises a set of instructions that are designed to greatly enhance the performance of advanced media and communications applications. See chapter 10 of the Intel Architecture Software Developers Manual, Volume 3: System Programming Guide, Order #245472.
Page 133
Polling Asking multiple devices, one at a time, if they have any data to transmit. POR See Power On Reset. Port Point of access to or from a system or circuit. For the USB, the point where a USB device is attached. Glossary-9 Intel® PXA27x Processor Family Optimization Guide...
Page 134
RTC Real-Time Clock SA-1110 StrongARM based applications processor for handheld products Intel® StrongARM* SA-1111 Companion chip for the Intel® SA-1110 processor SAD Sum of absolute differences Sample The smallest unit of data on which an endpoint operates; a property of an endpoint.
Page 135
Spread Spectrum An encoding technique patented by actress Hedy Lamarr and composer George Antheil, which broadcasts a signal over a range of frequencies. SRAM Static Random Access Memory. SRC See Sample Rate Conversion. SSE Streaming SIMD Extensions Glossary-11 Intel® PXA27x Processor Family Optimization Guide...
Page 136
SSE2 Streaming SIMD Extensions 2: for Intel Architecture machines, 144 new instructions, a 128-bit SIMD integer arithmetic and 128-bit SIMD double precision floating point instructions, enabling enhanced multimedia experiences. SSP Synchronous Serial Port SSTL Stub series terminated logic Stage One part of the sequence composing a control transfer; stages include the Setup stage, the Data stage, and the Status stage.
Page 137
WAP Wireless Application Protocol. WAP is a set of protocols that lets users of mobile phones and other digital wireless devices access Internet content, check voice mail and e-mail, receive text of faxes and conduct transactions. WAP works with multiple standards, including CDMA and GSM. Not all mobile devices support WAP. Glossary-13 Intel® PXA27x Processor Family Optimization Guide...
Page 138
A wireless LAN can serve as a replacement for, or an extension to, a traditional wired LAN. Wireless MMX Intel® Wireless MMX™ technology integrates the high performance of Intel® MMX ®...
Page 139
Branch Instruction Timings (Those Predicted By the BTB Data Cache and Buffer Behavior when X = 1 (Branch Target Buffer)) Data Cache and Buffer operation comparison for Intel® SA- Buffer for Capture Interface 9 1110 and Intel XScale® Microarchitecture, X=0...
Need help?
Do you have a question about the PXA270 and is the answer not in the manual?
Questions and answers