Cache Blocking; Loop Interchange - Intel PXA270 Optimization Manual

Pxa27x processor family
Table of Contents

Advertisement

High Level Language Optimization
5.1.3

Cache Blocking

Cache blocking techniques, such as strip-mining
data. Given a large data set that can be reused across multiple passes of a loop, data blocking
divides the data into smaller chunks which can be loaded into the cache during the first loop and
then be available for processing on subsequent loops thus minimizing cache misses and reducing
bus traffic.
As an example of cache blocking refer to this code:
for(i=0; i<10000; i++)
for(j=0; j<10000; j++)
for(k=0; k<10000; k++)
The variable A[i][k] is completely reused. However, accessing C[j][k] in the j and k loops can
displace A[i][j] from the cache. Using cache blocking, the code becomes:
for(i=0; i<10000; i++)
for(j1=0; j<100; j++)
for(k1=0; k<100; k++)
5.1.4

Loop Interchange

As previously mentioned, the sequence in which data is accessed affects cache thrashing. Usually,
it is best to access data in a spatially contiguous address range. However, arrays of data may have
been laid out such that indexed elements are not physically next to each other. Consider the
following C code which places array elements in row major order.
for(j=0; j<NMAX; j++)
for(i=0; i<NMAX; i++)
{
prefetch(A[i+1][j]);
sum += A[i][j];
}
In the above example, A[i][j] and A[i+1][j] are not sequentially next to each other. This situation
causes an increase in bus traffic when preloading loop data. In some cases where the loop
mathematics are unaffected, the problem can be resolved by induction variable interchange. The
above examples becomes:
for(i=0; i<NMAX; i++)
for(j=0; j<NMAX; j++)
1.
Spatially dispersing the data comprising one data set (for example, an array or structure) throughout a memory range instead of keeping the
data in contiguous memory locations.
5-8
C[j][k] += A[i][k] * B[j][i];
for(j2=0; j<100; j++)
for(k2=0; k<100; k++)
{
j = j1 * 100 + j2;
k = k1 * 100 + k2;
C[j][k] += A[i][k] * B[j][i];
}
1
, are used to improve the temporal locality of the
Intel® PXA27x Processor Family Optimization Guide

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents