Intel i86W Manual page 177

Table of Contents

Advertisement

PROGRAMMING EXAMPLES
9.12 CACHE STRATEGIES FOR MATRIX DOT PRODUCT
Calculations that use (and reuse) massive amounts of data may render significantly less
than optimum performance unless their memory access demands are carefully taken into
consideration during algorithm design. The prior Example 9-12 easily executes at near
the theoretical maximum speed of the i860 microprocessor because it does not make
heavy demands on the memory subsystem. This section considers a more demanding
calculation, the dot product of two matrices, and analyzes two memory access strategies
as they apply to this calculation.
The product of matrix
A=Ai,j
of dimension
L xM
with matrix
B=Bi,j
of dimension
M
x
N
is the matrix C
=
Ci,j
of dimension
L
x
N,
where ...
The basic algorithm for calculation of a dot product appears in Example 9-10. To extend
this algorithm to the current problem requires adding instructions to:
1. Load the entries of each matrix from memory at appropriate times.
2. Repeat the inner loop as many times as necessary to span matrices of arbitrary
M
dimension.
3. Repeat the entire algorithm
L
*
N
times to produce the
L
x
N
product matrix.
Each of the examples 9-13 and 9-14 accomplishes the above extensions through straight-
forward programming techniques. Each example uses dual-instruction mode to perform
the loading and loop control' operations in parallel with the basic floating-point calcula-
tions. The examples differ in their approaches to memory access and cache usage. To
eliminate needless complexity, the examples require that the
M
dimension be a multiple
of eight and that the B matrix be stored in memory by column instead of by row. Data is
fetched 32 bytes beyond the higher-address end of both matrices. In real applications,
programmers should ensure that no page protection faults occur que to these accesses.
• Example 9-13 depends solely on cached loads.
• Example 9-14 depends on a mix of cached and pipelined loads.
Example 9-13 uses the
fld
instruction for all loads, which places all elements of both
matrices A and B in the cache. This approach is ideal for small matrices. Accesses to all
elements (after the first access to each) retrieve elements from the cache at the rate of
one per clock. Using
fld.q
instructions to retrieve four elements at a time, it is possible to
overlap all data access as well as loop control with
m12apm
instructions in the inner
loop.
Note, however, that Example 9-13 is "cache bound"; i.e., if the combined size of the two
matrices is greater than that of the cache, cache misses will occur, degrading perfor-
mance. The larger the matrices, the more the misses that will occur.
9-15

Advertisement

Table of Contents
loading

Table of Contents