Intel PXA270 Optimization Manual page 72

Pxa27x processor family
Table of Contents

Advertisement

Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization
Multi-cycles may be interleaved with other instructions
The C-code for the N-Sample, T-Tap block FIR filter is also used to illustrate the multi-sample
technique.
for (i = 0; i < N; i++4) {
}
In the inner loop, we are calculating four output samples using the adjacent data samples x(n-i),
x(n-1+1), x(n-i+2) and x(n-i+3). The output samples y(n), y(n+1), y(n+2), and y(n+3) are assigned
to four 64-bit Intel® Wireless MMX™ Technology registers. In order to obtain near ideal
throughput, the inner loop is unrolled to provide for eight taps for each of the four output samples
per loops iteration.
; ** Update pointers,
Outer_Loop:
WLDRD
WZERO
WLDRD
WZERO
WLDRD
WZERO
WZERO
InnerLoop:
; ** Executes 8-Taps for each four outputs samples
; y(n),y(n+1), y(n+2),y(n+3)
SUBS
WMAC
WALIGNI
WMAC
WALIGNI
WMAC
WLDRD
WALIGNI
WLDRD
WMAC
4-24
s0=s1=s2=s3=0;
for (j = 0; j < T/4; j++4) {
s0 += a[j]*x[i-j];
s1 += a[j]*x[i-j+1];
s2 += a[j]*x[i-j+2];
s3 += a[j]*x[i-j+3]);
}
y[i] = round (s0);
y[i+1] = round (s1);
y[i+2] = round (s2);
y[i+3] = round (s3);
; ** Update pointers,zero accumulators and prime the loop with DWORD loads
wR0, [R1], #8
wR15
wR1, [R1], #8
wR14
wR8, [R2], #8; Load first 4 coefficients
wR13
wR12
R0
,R0
, #8
wR15,wR8 , wR0
wR3 ,wR1 , wR0, #2
wR14,wR8 , wR3
wR3 ,wR1 , wR0, #4
wR13,wR8 , wR3
wR0, [R1], #8
wR3 ,wR1 , wR0, #6
wR9, [R2], #8
wR12,wR8 , wR3
; Load first 4 input samples
; Load even groups of 4
; input samples
; Decrement loop counter
; y(n)+=
; y(n+1) +=
; y(n+2) +=
;next 4 input samples
; odd groups of 4 coeff.
; y(n+3) +=
Intel® PXA27x Processor Family Optimization Guide

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents