Performance Considerations; Branch Prediction - Intel PXA255 User Manual

Xscale microarchitecture
Hide thumbs Also See for PXA255:
Table of Contents

Advertisement

Performance Considerations

This chapter describes performance considerations that compiler writers, application programmers
and system designers need to be aware of to efficiently use the Intel® XScale™ core. Performance
numbers discussed here include branch prediction, and instruction latencies.
The timings in this section are specific to the PXA255 processor, and how it implements the ARM*
v5TE architecture. This is not a summary of all possible optimizations nor is it an explanation of
the ARM* v5TE instruction set. For information on instruction definitions and behavior consult the
ARM* Architecture Reference Manual.
11.1

Branch Prediction

The Intel® XScale™ core implements dynamic branch prediction for the ARM* instructions B and
BL and for the Thumb instruction B. Any instruction that specifies the PC as the destination is
predicted as not taken, and is not entered into the BTB. For example, an LDR or a MOV that loads
or moves directly to the PC will be predicted not taken and incur a branch latency penalty.
The instructions B and BL (including Thumb) enter into the branch target buffer when they are
taken for the first time. A taken branch refers to when they are evaluated to be true. Once in the
branch target buffer, the Intel® XScale™ core dynamically predicts the outcome of these
instructions based on previous outcomes.
instructions are correctly predicted and when they are not. A penalty of zero for correct prediction
means that the Intel® XScale™ core can execute the next instruction in the program flow in the
cycle following the branch.
Table 11-1. Branch Latency Penalty
Core Clock Cycles
ARM*
+0
+4
Intel® XScale™ Microarchitecture User's Manual
Thumb
Predicted Correctly. The instruction matches in the branch target buffer and is
+ 0
correctly predicted.
Mispredicted. There are three occurrences of branch misprediction, all of
which incur a 4-cycle branch delay penalty.
1. The instruction is in the branch target buffer and is predicted not-taken, but
+ 5
is actually taken.
2. The instruction is not in the branch target buffer and is a taken branch.
3. The instruction is in the branch target buffer and is predicted taken, but is
actually not-taken
Table 11-1
shows the branch latency penalty when these
Description
11
11-1

Advertisement

Table of Contents
loading

Table of Contents