Intel PXA270 Optimization Manual page 70

Pxa27x processor family

page of 144

/ 144
Contents
Table of Contents
Bookmarks

Table of Contents

Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization

or, in C-code,

for (i = 0; i < N; i++) {

}

The WMAC instruction is utilized for this calculation and provides for four parallel 16-bit by 16-

bit multiplications with accumulation. The first level of unrolling is a direct function of the four-

way SIMD instruction that is used to implement the filter.

The C-code for the real block FIR filter is re-written to illustrate that 4-taps are computed for each

loop iteration.

for (i = 0; i < N; i++) {

}

The direct assembly code implementation of the inner loop illustrates clearly that optimum

execution has not been accomplished. In the following code sequence we have several undesirable

stalls. The back-to-back LDRD instructions incur a 1 cycle stall, the load-to-use penalty incurs a

3 cycle stall. In addition, the loop overhead is high with 2 cycles being consumed for every

fourtaps.

; Pointers r0 -> val , r1 -> pResult, r2 -> pTapsQ15 r3 -> tapsLen

WZERO wR15

Loop_Begin:

WLDRD wR0, [r2], #8

WLDRD wR1, [r4], #8

4-22

∑

(

)

s = 0;

for (j = 0; j < T; j++) {

s += a[j]*x[i-j]);

}

y[i] = round (s);

s0= 0;

for (j = 0; j < T/4; j++4) {

s0 += a[j]*x[i+j];

s0 += a[j+1]*x[i+j+1];

s0 += a[j+2]*x[i+j+2];

s0 += a[j+3]*x[i+j+3];

}

y[i] = round (s0);

−

⋅

−

∨

≤

(

Intel® PXA27x Processor Family Optimization Guide

≤

−

Table of Contents

This manual is also suitable for:

Pxa271 Pxa272 Pxa273

Intel PXA270 Optimization Manual page 70

Related Manuals for Intel PXA270

Related Products for Intel PXA270

This manual is also suitable for:

Table of Contents