General Remarks On Software Pipelining; Multi-Sample Technique - Intel PXA270 Optimization Manual

Pxa27x processor family
Table of Contents

Advertisement

Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization
WMACS wR2, wR1, wR0
SUBS r3, r3, #4
BNE Loop_Begin
The parallelism of the filter may be exposed further by unrolling the loop to provide for eight taps
per iteration. In the following code sequence, the loop has been unrolled once allowing several
load-to-use stalls to be eliminated. The loop overhead has also been further amortized reducing it
from two cycles for every four taps to 2 cycles for every eight taps. There is still a single load-to-
use stall present between the second WLDRD instruction and the second WMACS instruction
within the inner loop
; Pointers r0 -> val , r1 -> pResult, r2 -> pTapsQ15 r3 -> tapsLen
WLDRD wR0, [r2] , #8
WZERO wR15
WLDRD wR1, [r4] , #8
Loop_Begin:
WLDRD wR2, [r2] , #8
SUBS r3, r3, #8
WLDRD wR3, [r4] , #8
WMACS wR15, wR1, wR0
WLDRDNE wR0, [r2] , #8
WMACS wR15, wR2, wR3
WLDRDNE wR1, [r4] , #8
BNE Loop_Begin
4.4.1.1

General Remarks on Software Pipelining

In the example for the real block FIR filter, two copies of the basic sequence of code were
interleaved eliminating all but one of the stalls. The throughput for the sequence went from
9 cycles for every four taps to 9 cycles for every eight taps. This corresponds to a throughput of
1.125 cycles per tap represents a 2X throughput improvement.
It is useful to define a metric to describe the number of copies of a basic sequence of instructions
which need to be interleaved in order to remove all stalls. We can call this the interleave factor, k.
The real block FIR filter requires k=2 to eliminate all possible stalls primarily because it is a small
sequence which must take into account the long load-to-use latency. In practice, k=2 is sufficient
for most loops encountered in real applications. This is fortunate because each interleaving requires
its own set of temporary registers and with some algorithms interleaving with k=3 is not possible.
A good rule of thumb is to try k=2 first, as it is usually the right choice.
4.4.2

Multi-Sample Technique

The multi-sample optimization technique provides for calculating multiple outputs with each loop
iteration similar to loop unrolling. The disadvantages of applying this technique include, increases
in code size for critical loops. Restrictions on the minimum and multiples of taps or samples are
also imposed. The obvious advantage is in reduced cycle consumption.
Memory bandwidth is reduced by data re-use.
Load-to-use stalls may be easily eliminated with scheduling.
Intel® PXA27x Processor Family Optimization Guide
4-23

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents