General Remarks On Multi-Sample Technique; Data Alignment Techniques - Intel PXA270 Optimization Manual

Pxa27x processor family
Table of Contents

Advertisement

Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization
WALIGNI
WALIGNI
WMAC
WALIGNI
WMAC
WLDRD
WMAC
WLDRD
WMAC
BNE
; ** Outer loop code calculates the last four taps for
; y(n), y(n+1), y(n+2), y(n+3)**
; ** Store results
BNE Outer_Loop
4.4.2.1

General Remarks on Multi-Sample Technique

In the example for the real block FIR filter, four outputs are computed simultaneously in the same
inner loop. This has allowed the re-use of coefficients and sample data loaded into the register for
computation of the first output to be used for the computation of the next three outputs. The
interleave factor is set at k=2, which results in the elimination of load-to-use stalls. The throughput
for the sequence is 20 cycles for every 32 taps, or 0.625 cycles per tap. This represents near ideal
saturation of the execution resources.
The multi-sample technique may be applied whenever the same data is being utilized for multiple
calculations. The large register file on Intel® Wireless MMX™ Technology facilitates this
approach and a number of variations are possible.
4.4.3

Data Alignment Techniques

The exploitation of the data parallelism present in multimedia algorithms is accomplished by
executing the same operation on different elements in parallel. This is accomplished by packing
several data elements into a single register and using the packed data instructions provided by the
Intel® Wireless MMX™ Technology.
An important guideline for achieving optimum performance is always to align memory references.
This means that an N-byte memory read or write should always be on an N-byte boundary. In some
it is easy to align data so that all of the reads and writes are aligned. In other cases it is more
difficult because an algorithm naturally reads data in a misaligned fashion. A couple of examples
of this include the single-sample FIR and video motion estimation.
The Intel® Wireless MMX™ Technology provides a mechanism for reducing the overhead
associated with the classes of algorithms which require data to be accessed on 32-bit, 16-bit, or 8-
bit binaries. The ALIGNI instruction is useful when the sequence of alignment is known
beforehand as with the single-sample FIR filter. The ALIGNR instruction is useful when sequence
of alignments are calculated when the algorithm executes as with the fast motion search algorithms
used in video compression. Both of these instructions operate on register pairs which may be
effectively ping-ponged with alternate loads reducing the alignments overhead significantly.
Intel® PXA27x Processor Family Optimization Guide
wR3 ,wR0 , wR1, #2
wR4 ,wR0 , wR1, #4
wR15,wR9 , wR1
; y(n) +=
wR5 ,wR0 , wR1, #6
wR14,wR9 , wR3
; y(n+1) +=
wR1, [R1], #8
; even groups of 4 inputs
wR13,wR9 , wR4
; y(n+2) +=
wR8, [R2], #8
; even groups of 4 coeff.
wR12,wR8 , wR5
; y(n+3) +=
Inner_Loop
4-25

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents