Case Study 5: 8X8 Block 1/2X Motion Compensation - Intel PXA270 Optimization Manual

Pxa27x processor family
Table of Contents

Advertisement

Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization
ldr r5, [r0], r4
ldr r11, [r0], r4
ldr r8, [r0], r4
ldr r12, [r0], r4
; These loads are scheduled to distinct destination registers
and r6, r5, r9
orr r6, r6, r11, lsl #16
and r11, r11, r9,
and r7, r8, r9
orr r11, r11, r5,
orr r7, r7,
str r6, [r1], #4
str r7, [r1], #4
and r12, r12, r9,
orr r12, r12, r8,
str r11, [r10], #4
str r12, [r10], #4
subs r14, r14, #1
bgt LOOP
In the following example, scheduled instructions take advantage of write-coalescing of multiple
store instructions to the same line. In this example, the two stores are combined in a single write-
buffer entry and issued as a single write request.
str r11, [r10], #4; Write Coalesce the two stores
str r12, [r10], #4
This can be exploited by either unrolling the C loop or by explicitly inlining multiple stores which
can be combined.
The register rotation technique also allows multiple loads to be outstanding.
4.6.5

Case Study 5: 8x8 Block 1/2X Motion Compensation

Bi-linear interpolation is a typical operation in image and video processing applications. For
example the video decode motion compensation uses the 1/2X interpolation operation. Using
Intel® Wireless MMX™ Technology features can help to accelerate these key applications. The
following code demonstrates how to attain this acceleration. These items are key issues for
optimizing the 1/2X motion compensation:
Use WALIGNR instruction for aligning the packed byte array
Use the WAVG2BR instruction for calculating the average of bytes.
Schedule around the load-to-use-latency
This example code is for the 1/2X interpolation:
; Test for special case of aligned ( LSBs = 110b and 000b)
; r0 -> pointer to misaligned array.
MOV r5,#7
AND r7,r0,r5
MOV r12,#4
Intel® PXA27x Processor Family Optimization Guide
; r0 = pSrc,
; r6->tmp = tmp0 & 0xffff;
; r6->tmp |= tmp1 << 16;
lsl #16 ; r11->tmp1 &= 0xffff0000;
; r7->tmp = tmp0 & 0xffff;
lsr #16 ; r11->tmp1 |= tmp0 >> 16;
r12, lsl #16
; r6->tmp |= tmp1 << 16;
; Write Coalesce the two stores
lsl #16 ; r11->tmp1 &= 0xffff0000;
lsr #16 ; r11->tmp1 |= tmp0 >> 16;
; Write Coalesce the two stores
; r5 =0x7
; r7 -> 3 LSBs of *psrc
; counter
4-33

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents