Scheduling The Wmul And Wmadd Instructions; Simd Optimization Techniques; Software Pipelining - Intel PXA270 Optimization Manual

Pxa27x processor family
Table of Contents

Advertisement

Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization
4.3.2.4

Scheduling the WMUL and WMADD Instructions

The issue latency of the WMUL and WMADD instructions is one cycle and the result and resource
latency are two cycles. The second WMUL instruction in the following example stalls for one
cycle due to the two cycle resource latency.
WMUL wR0, wR1, wR2
WMUL wR3, wR4, wR5
The WADD instruction in the following example stalls for one cycle due to the two cycle result
latency.
WMUL wR0, wR1, wR2
WADD wR1, wR0, wR2
4.4

SIMD Optimization Techniques

The Single Instruction Multiple Data, (SIMD), architectures provided by the Intel® Wireless
MMX™ Technology enables us to exploit the inherent parallelism found in the wide domain of
multimedia and communication applications. The most time-consuming code sequences have
certain characteristics in common:
Operations are performed on small-native-data types (8-bit pixels, 16-bit voice, 32-bit audio)
Regular and recurring memory access patterns, usually data independent
Localized, recurring computations performed on the data
Compute-intensive processing
In the following sections we illustrate how the rules for writing fast sequences of Intel® MMX™
Technology instructions on Intel® Wireless MMX™ Technology can be applied to the
optimization of short loops of Intel® MMX™ Technology code.
4.4.1

Software Pipelining

Software pipelining or loop unrolling is a well known optimization technique where multiple
calculations are in executed with each loop iteration. The disadvantages of applying this technique
include: increases in code size for critical loops and restrictions on the minimum and multiples of
taps or samples
The obvious advantage is in reduced cycle consumption. Overhead from loop exit testing may be
reduced load-use stalls may be minimized and in some cases eliminated completely instruction
scheduling opportunities may be created and exploited.
To illustrate the need for software pipe-lining, lets consider a key kernel of Intel® MMX™
Technology code that is central to many signal-processing algorithms, the real block Finite-
Impulse-Response (FIR) filter. A real block FIR filter operates on two real vectors c(i) and x(i) and
produces and output vector y(n). The vectors are represented for Intel® MMX™ Technology
programming as arrays of 16-bit integers of some length N. The real FIR filter is represented by the
equation:
Intel® PXA27x Processor Family Optimization Guide
4-21

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents