Scheduling Data Processing Instructions - Intel XScale Core Developer's Manual

Table of Contents

Advertisement

Intel XScale® Core Developer's Manual
Optimization Guide
A.5.2

Scheduling Data Processing Instructions

Most core data processing instructions have a result latency of 1 cycle. This means that the current
instruction is able to use the result from the previous data processing instruction. However, the
result latency is 2 cycles if the current instruction needs to use the result of the previous data
processing instruction for a shift by immediate. As a result, the following code segment would
incur a 1 cycle stall for the mov instruction:
sub
add
mov
The code above can be rearranged as follows to remove the 1 cycle stall:
add
sub
mov
All data processing instructions incur a 2 cycle issue penalty and a 2 cycle result penalty when the
shifter operand is a shift/rotate by a register or shifter operand is RRX. Since the next instruction
would always incur a 2 cycle issue penalty, there is no way to avoid such a stall except by
re-writing the assembler instruction. Consider the following segment of code:
mov
mul
add
sub
The subtract instruction would incur a 1 cycle stall due to the issue latency of the add instruction as
the shifter operand is shift by a register. The issue latency can be avoided by changing the code as
follows:
mov
mul
add
sub
212
r6, r7, r8
r1, r2, r3
r4, r1, LSL #2
r1, r2, r3
r6, r7, r8
r4, r1, LSL #2
r3, #10
r4, r2, r3
r5, r6, r2, LSL r3
r7, r8, r2
r3, #10
r4, r2, r3
r5, r6, r2, LSL #10
r7, r8, r2
January, 2004
Developer's Manual

Advertisement

Table of Contents
loading

Table of Contents