Array Merging - Intel PXA270 Optimization Manual

Pxa27x processor family
Table of Contents

Advertisement

High Level Language Optimization
5.1.1.3.3
Coding Technique: Preload to Reduce Register Pressure
Preloading can reduce register pressure. When data is needed for an operation, the load should be
scheduled far enough in advance to hide the load latency. However, the load ties up the receiving
register until the data can be used. For example:
ldr
; Process code {not yet cached latency > 60 core clocks}
add
In the above case, R2 is unavailable for processing until the add statement. Preloading the data load
frees the register for use. The example code becomes:
pld
; Process code
ldr
; Process code {ldr result latency is 3 core clocks}
add
With the added preload, register R2 can be used for other operations until just before it is needed.
Apart from code optimization for preload, there are many other techniques to use while writing C
and C++ code; these are discussed in later chapters.
5.1.2

Array Merging

Stride (the way data structures are walked through) can affect the temporal quality of the data and
reduce or increase cache conflicts. Intel XScale® Microarchitecture data cache and mini-data
caches each have 32 sets of 32 bytes. This means that each cache line in a set is on a modular 1K-
address boundary. It is important to choose data structure sizes and stride requirements that do not
overwhelm a given set causing conflicts and increased register pressure. Register pressure can be
increased because additional registers are required to track preload addresses. This can be achieved
by rearranging data structure components to use more parallel access to search and compare
elements. Similarly, rearranging data structures so that the sections that are often written fit in the
same half cache line
array merging can enhance the spatial locality of the data.
As an example of array merging, refer to this code:
int a[NMAX];
int b[NMAX];
int ix;
for (i=0; i<NMAX]; i++)
{
ix = b[i];
if (a[i]!= 0)
ix = a[i];
do_other calculations;
}
1.
A half cache line is 16 bytes for the Intel XScale® Microarchitecture
5-6
r2, [r0]
r1, r1, r2
[r0] ;preload the data keeping r2 available for use
r2, [r0]
r1, r1, r2
1
can reduce cache eviction write-backs. On a global scale, techniques such as
Intel® PXA27x Processor Family Optimization Guide

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents