T
b
Note that the potential effects of µop reordering are not factored into the
estimations discussed.
Examine Example E-1 that uses the
prefetch scheduling distance of 3, that is, psd = 3. The data prefetched in
iteration i, will actually be used in iteration i+3. T
needed to execute
while il (iteration latency) represents the cycles needed to execute this
loop with actually run-time memory footprint. T
computing the critical path latency of the code dependency graph. This
work is quite arduous without help from special performance
characterization tools or compilers. A simple heuristic for estimating the
T
value is to count the number of instructions in the critical path and
c
multiply the number with an artificial CPI. A reasonable CPI value
would be somewhere between 1.0 and 1.5 depending on the quality of
code scheduling.
Example E-1 Calculating Insertion for Scheduling Distance of 3
top_loop:
prefetchnta [edx+esi+32*3]
prefetchnta [edx*4+esi+32*3]
. . . . .
movaps xmm1, [edx+esi]
movaps xmm2, [edx*4+esi]
movaps xmm3, [edx+esi+16]
movaps xmm4, [edx*4+esi+16]
. . . . .
. . .
add esi, 32
cmp esi, ecx
jl top_loop
Mathematics of Prefetch Scheduling Distance
data transfer latency which is equal to number of lines
per iteration * line burst latency
prefetchnta
- assuming all the memory accesses hit L1
top_loop
instruction with a
represents the cycles
c
can be determined by
c
E
E-3
Need help?
Do you have a question about the ARCHITECTURE IA-32 and is the answer not in the manual?