Full Core Utilization - Intel Pentium II Developer's Manual

Hide thumbs Also See for Pentium II:
Table of Contents

Advertisement

MICRO-ARCHITECTURE OVERVIEW
2.1.

FULL CORE UTILIZATION

The three independent-engine approach was taken to more fully utilize the processor core.
Consider the pseudo code fragment in Figure 2-2:
The first instruction in this example is a load of r1 that, at run time, causes a cache miss. A
traditional processor core must wait for its bus interface unit to read this data from main
memory and return it before moving on to instruction 2. This processor stalls while waiting
for this data and is thus being under-utilized.
To avoid this memory latency problem, a P6 family processor "looks-ahead" into the
instruction pool at subsequent instructions and does useful work rather than stalling. In the
example in Figure 2-2, instruction 2 is not executable since it depends upon the result of
instruction 1; however both instructions 3 and 4 have no prior dependencies and are therefore
executable. The processor executes instructions 3 and 4 out-of-order. The results of this out-
of-order execution can not be committed to permanent machine state (i.e., the programmer-
visible registers) immediately since the original program order must be maintained. The
results are instead stored back in the instruction pool awaiting in-order retirement. The core
executes instructions depending upon their readiness to execute, and not on their original
program order, and is therefore a true dataflow engine. This approach has the side effect that
instructions are typically executed out-of-order.
The cache miss on instruction 1 will take many internal clocks, so the core continues to look
ahead for other instructions that could be speculatively executed, and is typically looking 20
to 30 instructions in front of the instruction pointer. Within this 20 to 30 instruction window
there will be, on average, five branches that the fetch/decode unit must correctly predict if
the dispatch/execute unit is to do useful work. The sparse register set of an Intel Architecture
(IA) processor will create many false dependencies on registers so the dispatch/execute unit
will rename the Intel Architecture registers into a larger register set to enable additional
forward progress. The Retire Unit owns the programmer's Intel Architecture register set and
results are only committed to permanent machine state in these registers when it removes
completed instructions from the pool in original program order.
Dynamic Execution technology can be summarized as optimally adjusting instruction
execution by predicting program flow, having the ability to speculatively execute instructions
in any order, and then analyzing the program's dataflow graph to choose the best order to
execute the instructions.
2-2
r1 <= mem [r0]
r2 <= r1 + r2
/* Instruction 2 */
r5 <= r5 + 1
/* Instruction 3 */
r6 <= r6 - r3
/* Instruction 4 */
Figure 2-2. A Typical Pseudo Code Fragment
/* Instruction 1 */
000922

Advertisement

Table of Contents
loading

Table of Contents