Behavioral Description; Perils Of Superpipelining - Intel PXA270 Optimization Manual

Pxa27x processor family

page of 144

/ 144
Contents
Table of Contents
Bookmarks

Table of Contents

Microarchitecture Overview

These are important characteristics about the MAC:

•

The MAC is not a true pipeline. The processing of a single instruction requires use of the same

data-path resources for several cycles before a new instruction is accepted. The type of

instruction and source arguments determine the number of required cycles.

•

No more than two instructions can concurrently occupy the MAC pipeline.

•

When the MAC is processing an instruction, another instruction cannot enter M1 unless the

original instruction completes in the next cycle.

•

The MAC unit can operate on 16-bit packed signed data. This reduces register pressure and

memory traffic size. Two 16-bit data items can be loaded into a register with one LDR.

•

The MAC can achieve throughput of one multiply per cycle when performing a 16-by-32-bit

multiply.

•

ACC registers in the Intel XScale® Microarchitecture can be up to 64 bits in future

implementations. Code should be written to depend on the 40-bit nature of the current

implementation.

2.2.5.1

Behavioral Description

The execution of the MAC unit starts at the beginning of the M1 pipestage. At this point, the MAC

unit receives two 32-bit source operands. Results are completed N cycles later (where N is

dependent on the operand size) and returned to the register file. For more information on MAC

instruction latencies, refer to

Microarchitecture".

An instruction occupying the M1 or M2 pipestages occupies the X1 and X2 pipestage, respectively.

Each cycle, a MAC operation progresses for M1 to M5. A MAC operation may complete anywhere

from M2-M5.

2.2.5.2

Perils of Superpipelining

The longer pipeline has several consequences worth considering:

•

Larger branch misprediction penalty (four cycles in the Intel XScale® Microarchitecture

instead of one in StrongARM* Architecture).

•

Larger load use delay (LUD) — LUDs arise from load-use dependencies. A load-use

dependency gives rise to a LUD if the result of the load instruction cannot be made available

by the pipeline in time for the subsequent instruction. To avoid these penalties, an optimizing

compiler should take advantage of the core's multiple outstanding load capability (also called

hit-under-miss) as well as finding independent instructions to fill the slot following the load.

•

Certain instructions incur a few extra cycles of delay with the Intel XScale® Microarchitecture

as compared to StrongARM* processors (LDM, STM).

•

Decode and register file lookups are spread out over two cycles with the Intel XScale®

Microarchitecture, instead of one cycle in predecessors.

2-6

Section 4.8, "Instruction Latencies for Intel XScale®

Intel® PXA27x Processor Family Optimization Guide

Table of Contents

This manual is also suitable for:

Pxa271 Pxa272 Pxa273

Behavioral Description; Perils Of Superpipelining - Intel PXA270 Optimization Manual

Behavioral Description

Perils of Superpipelining

Related Manuals for Intel PXA270

Related Content for Intel PXA270

This manual is also suitable for:

Table of Contents