Branch Hints; Execution Core Detail; Instruction Latency And Throughput - Intel NetBurst User Manual

Intel pentium 4 processor user manual

Hide thumbs

Table Of Contents

Table of Contents

™

A Detailed Look Inside the Intel

NetBurst

Micro-Architecture of the Intel Pentium

4 Processor

The Static Predictor. Once the branch instruction is decoded, the direction of the branch (forward or backward) is

known. If there was no valid entry in the BTB for the branch, the static predictor makes a prediction based on the

direction of the branch. The static prediction mechanism predicts backward conditional branches (those with

negative displacement), such as loop-closing branches, as taken. Forward branches are predicted not taken.

Branch Target Buffer. Once branch history is available, the Pentium 4 processor can predict the branch outcome

before the branch instruction is even decoded, based on a history of previously-encountered branches. It uses a

branch history table and a branch target buffer (collectively called the BTB) to predict the direction and target of

branches based on an instruction's linear address. Once the branch is retired, the BTB is updated with the target

address.

Return Stack. Returns are always taken, but since a procedure may be invoked from several call sites, a single

predicted target will not suffice. The Pentium 4 processor has a Return Stack that can predict return addresses for a

series of procedure calls. This increases the benefit of unrolling loops containing function calls. It also mitigates the

need to put certain procedures inline since the return penalty portion of the procedure call overhead is reduced.

Even if the direction and target address of the branch are correctly predicted well in advance, a taken branch may

reduce available parallelism in a typical processor, since the decode bandwidth is wasted for instructions which

immediately follow the branch and precede the target, if the branch does not end the line and target does not begin

the line. The branch predictor allows a branch and its target to coexist in a single trace cache line, maximizing

instruction delivery from the front end.

Branch Hints

The Pentium 4 processor provides a feature that permits software to provide hints to the branch prediction and trace

formation hardware to enhance their performance. These hints take the form of prefixes to conditional branch

instructions. These prefixes have no effect for pre-Pentium 4 processor implementations. Branch hints are not

guaranteed to have any effect, and their function may vary across implementations. However, since branch hints are

architecturally visible, and the same code could be run on multiple implementations, they should be inserted only in

cases which are likely to be helpful across all implementations.

Branch hints are interpreted by the translation engine, and are used to assist branch prediction and trace construction

hardware. They are only used at trace build time, and have no effect within already-built traces. Directional hints

override the static (forward-taken, backward-not taken) prediction in the event that a BTB prediction is not

available. Because branch hints increase code size slightly, the preferred approach to providing directional hints is

by the arrangement of code so that

(i) forward branches that are more probable should be in the not-taken path, and

(ii) backward branches that are more probable should be in the taken path. Since the branch prediction information

that is available when the trace is built is used to predict which path or trace through the code will be taken,

directional branch hints can help traces be built along the most likely path.

Execution Core Detail

The execution core is designed to optimize overall performance by handling the most common cases most

efficiently. The hardware is designed to execute the most frequent operations in the most common context as fast as

possible, at the expense of less-frequent operations in rare context. Some parts of the core may speculate that a

common condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertains

to store forwarding. If a load is predicted to be dependent on a store, it gets its data from that store and tentatively

proceeds. If the load turned out not to depend on the store, the load is delayed until the real data has been loaded

from memory, then it proceeds.

Instruction Latency and Throughput

The superscalar, out-of-order core contains multiple execution hardware resources that can execute multiple ops in

parallel. The core's ability to make use of available parallelism can be enhanced by:

Page 13

Table of Contents

Branch Hints; Execution Core Detail; Instruction Latency And Throughput - Intel NetBurst User Manual

Branch Hints

Execution Core Detail

Instruction Latency and Throughput

Related Manuals for Intel NetBurst

Related Content for Intel NetBurst

Table of Contents