Software Pipelining and Loop Support
5.1
Overview
The Itanium architecture provides extensive support for software-pipelined loops,
including register rotation, special loop branches, and application registers. When
combined with predication and support for speculation, these features help to reduce
code expansion, path length, and branch mispredictions for loops that can be software
pipelined.
The beginning of this chapter reviews basic loop terminology and instructions, and
describes the problems that arise when optimizing loops in the absence of architectural
support. Specific loop support features of the Itanium architecture are then introduced.
The remainder of this chapter describes the programming and optimization of various
type of loops.
5.2
Loop Terminology and Basic Loop Support
Loops can be categorized into two types: counted and while. In counted loops, the loop
condition is based on the value of a loop counter and the trip count can be computed
prior to starting the loop. In while loops, the loop condition is a more general
calculation (not a simple count) and the trip count is unknown. Both types are directly
supported in the architecture.
The Itanium architecture improves the performance of conventional counted loops by
providing a special counted loop branch (the br.cloop instruction) and the Loop Count
application register (LC).
Instead, the branching decision is based on the value of the LC register. If the LC
register is greater than zero, it is decremented and the br.cloop branch is taken.
5.3
Optimization of Loops
In many loops, there are not enough independent instructions within a single iteration
to hide execution latency and make full use of the functional units. For example, in the
loop body below, there is very little ILP:
L1:
In this code, all the instructions from iteration X are executed before iteration X+1 is
started. Assuming that the store from iteration X and the load from iteration X+1 are
independent memory references, utilization of the functional units could be improved
by moving independent instructions from iteration X+1 to iteration X, effectively
overlapping iteration X with iteration X+1.
Volume 1, Part 2: Software Pipelining and Loop Support
The br.cloop instruction does not have a branch predicate.
ld4
r4 = [r5],4;;
add
r7 = r4,r9;;
st4
[r6] = r7,4
br.cloopL1;;
// Cycle 0 load postinc 4
// Cycle 2
// Cycle 3 store postinc 4
// Cycle 3
5
1:181
Need help?
Do you have a question about the ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS MANUAL VOLUME 1 REV 2.3 and is the answer not in the manual?
Questions and answers