Low Number Of Iterations; Bandwidth Limitations - Intel XScale Core Developer's Manual

Table of Contents

Advertisement

Intel XScale® Core Developer's Manual
Optimization Guide
A.4.4.5.

Low Number of Iterations

Loops with very low iteration counts may have the advantages of prefetch completely mitigated. A
loop with a small fixed number of iterations may be faster if the loop is completely unrolled rather
than trying to schedule prefetch instructions.
A.4.4.6.

Bandwidth Limitations

Overuse of prefetches can usurp resources and degrade performance. This happens because once
the bus traffic requests exceed the system resource capacity, the processor stalls. The core data
transfer resources are:
4 fill buffers
4 pending buffers
8 half cache line write buffer
SDRAM resources are typically:
4 memory banks
1 page buffer per bank referencing a 4K address range
4 transfer request buffers
Consider how these resources work together. A fill buffer is allocated for each cache read miss. A
fill buffer is also allocated each cache write miss if the memory space is write allocate along with a
pending buffer. A subsequent read to the same cache line does not require a new fill buffer, but
does require a pending buffer and a subsequent write will also require a new pending buffer. A fill
buffer is also allocated for each read to a non-cached memory and a write buffer is needed for each
memory write to non-cached memory that is non-coalescing. Consequently, a STM instruction
listing eight registers and referencing non-cached memory will use eight write buffers assuming
they don't coalesce and two write buffers if they do coalesce. A cache eviction requires a write
buffer for each dirty bit set in the cache line. The prefetch instruction requires a fill buffer for each
cache line and 0, 1, or 2 write buffers for an eviction.
When adding prefetch instructions, caution must be asserted to insure that the combination of
prefetch and instruction bus requests do not exceed the system resource capacity described above
or performance will be degraded instead of improved. The important points are to spread prefetch
operations over calculations so as to allow bus traffic to free flow and to minimize the number of
necessary prefetches.
200
January, 2004
Developer's Manual

Advertisement

Table of Contents
loading

Table of Contents