Intel i86W Manual page 177

page of 241

/ 241
Contents
Table of Contents
Bookmarks

Table of Contents

PROGRAMMING EXAMPLES

9.12 CACHE STRATEGIES FOR MATRIX DOT PRODUCT

Calculations that use (and reuse) massive amounts of data may render significantly less

than optimum performance unless their memory access demands are carefully taken into

consideration during algorithm design. The prior Example 9-12 easily executes at near

the theoretical maximum speed of the i860 microprocessor because it does not make

heavy demands on the memory subsystem. This section considers a more demanding

calculation, the dot product of two matrices, and analyzes two memory access strategies

as they apply to this calculation.

The product of matrix

A=Ai,j

of dimension

L xM

with matrix

B=Bi,j

of dimension

is the matrix C

Ci,j

of dimension

where ...

The basic algorithm for calculation of a dot product appears in Example 9-10. To extend

this algorithm to the current problem requires adding instructions to:

1. Load the entries of each matrix from memory at appropriate times.

2. Repeat the inner loop as many times as necessary to span matrices of arbitrary

dimension.

3. Repeat the entire algorithm

times to produce the

product matrix.

Each of the examples 9-13 and 9-14 accomplishes the above extensions through straight-

forward programming techniques. Each example uses dual-instruction mode to perform

the loading and loop control' operations in parallel with the basic floating-point calcula-

tions. The examples differ in their approaches to memory access and cache usage. To

eliminate needless complexity, the examples require that the

dimension be a multiple

of eight and that the B matrix be stored in memory by column instead of by row. Data is

fetched 32 bytes beyond the higher-address end of both matrices. In real applications,

programmers should ensure that no page protection faults occur que to these accesses.

• Example 9-13 depends solely on cached loads.

• Example 9-14 depends on a mix of cached and pipelined loads.

Example 9-13 uses the

fld

instruction for all loads, which places all elements of both

matrices A and B in the cache. This approach is ideal for small matrices. Accesses to all

elements (after the first access to each) retrieve elements from the cache at the rate of

one per clock. Using

fld.q

instructions to retrieve four elements at a time, it is possible to

overlap all data access as well as loop control with

m12apm

instructions in the inner

loop.

Note, however, that Example 9-13 is "cache bound"; i.e., if the combined size of the two

matrices is greater than that of the cache, cache misses will occur, degrading perfor-

mance. The larger the matrices, the more the misses that will occur.

9-15

Table of Contents

Need help?

Do you have a question about the i86W and is the answer not in the manual?

Intel i86W Manual page 177

Need help?

Subscribe to Our Youtube Channel

Related Manuals for Intel i86W

Related Products for Intel i86W

Table of Contents