Behavioral Description; Perils Of Superpipelining - Intel PXA270 Optimization Manual

Pxa27x processor family
Table of Contents

Advertisement

Microarchitecture Overview
These are important characteristics about the MAC:
The MAC is not a true pipeline. The processing of a single instruction requires use of the same
data-path resources for several cycles before a new instruction is accepted. The type of
instruction and source arguments determine the number of required cycles.
No more than two instructions can concurrently occupy the MAC pipeline.
When the MAC is processing an instruction, another instruction cannot enter M1 unless the
original instruction completes in the next cycle.
The MAC unit can operate on 16-bit packed signed data. This reduces register pressure and
memory traffic size. Two 16-bit data items can be loaded into a register with one LDR.
The MAC can achieve throughput of one multiply per cycle when performing a 16-by-32-bit
multiply.
ACC registers in the Intel XScale® Microarchitecture can be up to 64 bits in future
implementations. Code should be written to depend on the 40-bit nature of the current
implementation.
2.2.5.1

Behavioral Description

The execution of the MAC unit starts at the beginning of the M1 pipestage. At this point, the MAC
unit receives two 32-bit source operands. Results are completed N cycles later (where N is
dependent on the operand size) and returned to the register file. For more information on MAC
instruction latencies, refer to
Microarchitecture".
An instruction occupying the M1 or M2 pipestages occupies the X1 and X2 pipestage, respectively.
Each cycle, a MAC operation progresses for M1 to M5. A MAC operation may complete anywhere
from M2-M5.
2.2.5.2

Perils of Superpipelining

The longer pipeline has several consequences worth considering:
Larger branch misprediction penalty (four cycles in the Intel XScale® Microarchitecture
instead of one in StrongARM* Architecture).
Larger load use delay (LUD) — LUDs arise from load-use dependencies. A load-use
dependency gives rise to a LUD if the result of the load instruction cannot be made available
by the pipeline in time for the subsequent instruction. To avoid these penalties, an optimizing
compiler should take advantage of the core's multiple outstanding load capability (also called
hit-under-miss) as well as finding independent instructions to fill the slot following the load.
Certain instructions incur a few extra cycles of delay with the Intel XScale® Microarchitecture
as compared to StrongARM* processors (LDM, STM).
Decode and register file lookups are spread out over two cycles with the Intel XScale®
Microarchitecture, instead of one cycle in predecessors.
2-6
Section 4.8, "Instruction Latencies for Intel XScale®
Intel® PXA27x Processor Family Optimization Guide

Advertisement

Table of Contents
loading

This manual is also suitable for:

Pxa271Pxa272Pxa273

Table of Contents