Intel PXA270 Optimization Manual page 70

Pxa27x processor family
Table of Contents

Advertisement

Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization
or, in C-code,
for (i = 0; i < N; i++) {
}
The WMAC instruction is utilized for this calculation and provides for four parallel 16-bit by 16-
bit multiplications with accumulation. The first level of unrolling is a direct function of the four-
way SIMD instruction that is used to implement the filter.
The C-code for the real block FIR filter is re-written to illustrate that 4-taps are computed for each
loop iteration.
for (i = 0; i < N; i++) {
}
The direct assembly code implementation of the inner loop illustrates clearly that optimum
execution has not been accomplished. In the following code sequence we have several undesirable
stalls. The back-to-back LDRD instructions incur a 1 cycle stall, the load-to-use penalty incurs a
3 cycle stall. In addition, the loop overhead is high with 2 cycles being consumed for every
fourtaps.
; Pointers r0 -> val , r1 -> pResult, r2 -> pTapsQ15 r3 -> tapsLen
WZERO wR15
Loop_Begin:
WLDRD wR0, [r2], #8
WLDRD wR1, [r4], #8
4-22
L
=
y
(
n
)
i
s = 0;
for (j = 0; j < T; j++) {
s += a[j]*x[i-j]);
}
y[i] = round (s);
s0= 0;
for (j = 0; j < T/4; j++4) {
s0 += a[j]*x[i+j];
s0 += a[j+1]*x[i+j+1];
s0 += a[j+2]*x[i+j+2];
s0 += a[j+3]*x[i+j+3];
}
y[i] = round (s0);
1
c
x
(
n
i
),
0
n
i
=
0
Intel® PXA27x Processor Family Optimization Guide
N
1

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents