General Remarks On Software Pipelining; Multi-Sample Technique - Intel PXA270 Optimization Manual

Pxa27x processor family

page of 144

/ 144
Contents
Table of Contents
Bookmarks

Table of Contents

Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization

WMACS wR2, wR1, wR0

SUBS r3, r3, #4

BNE Loop_Begin

The parallelism of the filter may be exposed further by unrolling the loop to provide for eight taps

per iteration. In the following code sequence, the loop has been unrolled once allowing several

load-to-use stalls to be eliminated. The loop overhead has also been further amortized reducing it

from two cycles for every four taps to 2 cycles for every eight taps. There is still a single load-to-

use stall present between the second WLDRD instruction and the second WMACS instruction

within the inner loop

; Pointers r0 -> val , r1 -> pResult, r2 -> pTapsQ15 r3 -> tapsLen

WLDRD wR0, [r2] , #8

WZERO wR15

WLDRD wR1, [r4] , #8

Loop_Begin:

WLDRD wR2, [r2] , #8

SUBS r3, r3, #8

WLDRD wR3, [r4] , #8

WMACS wR15, wR1, wR0

WLDRDNE wR0, [r2] , #8

WMACS wR15, wR2, wR3

WLDRDNE wR1, [r4] , #8

BNE Loop_Begin

4.4.1.1

General Remarks on Software Pipelining

In the example for the real block FIR filter, two copies of the basic sequence of code were

interleaved eliminating all but one of the stalls. The throughput for the sequence went from

9 cycles for every four taps to 9 cycles for every eight taps. This corresponds to a throughput of

1.125 cycles per tap represents a 2X throughput improvement.

It is useful to define a metric to describe the number of copies of a basic sequence of instructions

which need to be interleaved in order to remove all stalls. We can call this the interleave factor, k.

The real block FIR filter requires k=2 to eliminate all possible stalls primarily because it is a small

sequence which must take into account the long load-to-use latency. In practice, k=2 is sufficient

for most loops encountered in real applications. This is fortunate because each interleaving requires

its own set of temporary registers and with some algorithms interleaving with k=3 is not possible.

A good rule of thumb is to try k=2 first, as it is usually the right choice.

4.4.2

Multi-Sample Technique

The multi-sample optimization technique provides for calculating multiple outputs with each loop

iteration similar to loop unrolling. The disadvantages of applying this technique include, increases

in code size for critical loops. Restrictions on the minimum and multiples of taps or samples are

also imposed. The obvious advantage is in reduced cycle consumption.

•

Memory bandwidth is reduced by data re-use.

•

Load-to-use stalls may be easily eliminated with scheduling.

Intel® PXA27x Processor Family Optimization Guide

4-23

Table of Contents

This manual is also suitable for:

Pxa271 Pxa272 Pxa273

General Remarks On Software Pipelining; Multi-Sample Technique - Intel PXA270 Optimization Manual

General Remarks on Software Pipelining

Multi-Sample Technique

Related Manuals for Intel PXA270

Related Content for Intel PXA270

This manual is also suitable for:

Table of Contents