Texas Instruments TMS320C64X Programmer's Reference Manual page 102

Dsp little-endian dsp library
Hide thumbs Also See for TMS320C64X:
Table of Contents

Advertisement

DSP_mat_mul
Special Requirements
Implementation Notes
Benchmarks
4-74
for (i = 0; i < r1; i++)
for (j = 0; j < c2; j++)
{
sum = 0;
for (k = 0; k < c1; k++)
sum += x[k + i*c1] * y[j + k*c2];
r[j + i*c2] = sum >> qs;
}
}
-
The arrays x[], y[], and r[] are stored in distinct arrays. That is, in-place
processing is not allowed.
-
The input matrices have minimum dimensions of at least 1 row and 1
column, and maximum dimensions of 32767 rows and 32767 columns.
-
Bank Conflicts: No bank conflicts occur.
-
Interruptibility: This code blocks interrupts during its innermost loop.
Interrupts are not blocked otherwise. As a result, interrupts can be blocked
for up to 0.25*c1' + 16 cycles at a time.
-
The 'i' loop and 'k' loops are unrolled 2x. The 'j' loop is unrolled 4x. For
dimensions that are not multiples of the various loops' unroll factors, this
code calculates extra results beyond the edges of the matrix. These extra
results are ultimately discarded. This allows the loops to be unrolled for
efficient operation on large matrices while not losing flexibility.
Cycles
0.25 * ( r1' * c2' * c1' ) + 2.25 * ( r1' * c2' ) + 11, where:
r1' = 2 * ceil(r1/2.0) (r1 rounded up to next even)
c1' = 2 * ceil(c1/2.0) (c1 rounded up to next even)
c2' = 4 * ceil(c2/4.0) (c2 rounded up to next mult of 4)
For r1= 1, c1= 1, c2= 1: 33 cycles
For r1= 8, c1=20, c2= 8: 475 cycles
Codesize
416 bytes

Hide quick links:

Advertisement

Table of Contents
loading

This manual is also suitable for:

Tms320c64x+

Table of Contents