Arbitration Stages - IBM A2 User Manual

Table of Contents

Advertisement

User's Manual
A2 Processor
As illustrated in Figure D-1, the front end of the pipeline consists of seven stages, IU0 through IU6. The
front end is responsible for fetching instructions, predicting branches, checking for register dependencies,
and arbitrating between threads for instruction issue. The back end of the pipeline consists of eight stages,
RF0 - 1 and EX1 - 6. The back end is responsible for executing instructions and interfacing to the L2.
The IU4, IU5, and IU6 stages are replicated for each thread. All other stages are shared in a fine-grain
manner. Instructions from different threads are interleaved on a cycle-by-cycle basis.
In the IU0 - IU4 pipeline stages, the next one to four instructions from one thread are fetched from the I-cache
and decoded. Branches are predicted in the IU3 and IU4 stages (see Section 2.8.4.6 Wait Instruction on
page 98 for more details about branch instructions and prediction). Up to eight instructions per thread are
buffered in IU4 in the instruction buffer (IBUFF). Instructions are not fetched unless there is room for them in
the instruction buffer. Hence, there are no stalls before IU4.
The single oldest instruction is decoded and sent to the IU5 stage. Register dependency checking is
performed in IU5, and the instruction stalls here if input operands are not available. Instructions can stall in
IU5 for a variety of other typically infrequent reasons described in detail later. Because IU4 and IU5 are repli-
cated per thread, stalls at IU5 affect only that thread.
If the instruction is ready to issue, it is forwarded to the IU6 stage. The IU6 pipeline stage holds one ready
instruction from each thread. IU6 selects one of these for issue to the XU and the FU (if present) each cycle
whenever possible. Instructions can stall in IU6 for a variety of other typically infrequent reasons described in
detail later.
The last seven stages of the pipeline are unified for integer arithmetic and logic instructions, load and store
instructions, and branch instructions. Register file access and bypassing is performed in RF0 and RF1.
Branches and most simple ALU instruction produce their results in EX1. The data cache directory and the
D-ERAT are accessed in EX2. The data cache data array is accessed in EX4. Stores and loads that miss the
data cache are sent to the L2 in EX6.
The subsequent sections of this appendix provide additional details about the performance of various instruc-
tion sequences, including the latencies of various instruction pairs.
D.1.1 Arbitration Stages
Arbitration between threads occurs at three points in the pipeline: IU0, IU6, and EX6.
The IU0 stage is responsible for selecting from which of the four possible threads to fetch. Each cycle, one
thread is selected in a round-robin fashion. Threads that are not able to fetch instructions for any reason are
passed over in the round-robin sequence.
The IU6 stage selects which thread will issue an instruction each cycle to both the FXU and the FU (if
present). This is also done in a fair round-robin fashion, and threads that do not have any instruction available
for issue are passed over in the round-robin sequence.
The EX6 stage selects which command can be sent down to the L2 in each cycle. Commands can come from
stores, load data cache misses in the load miss queue, instruction cache misses, and TLB PTE loads (if the
TLB is present).
Instruction Execution Performance and Code Optimizations
Version 1.3
Page 834 of 864
October 23, 2012

Advertisement

Table of Contents
loading

Table of Contents