Packed-Data Processing on the 'C64x
Figure 8–18. Graphical Representation of the _dotp2 Intrinsic c = _dotp2(b, a)
8-30
While this code is fully vectorized, it still can be improved. The kernel itself is
performing two LDDWs, two MPY2, four ADDs, and one Branch. Because of
the large number of ADDs, the loop cannot fit in a single cycle, and so the 'C64x
datapath is not used efficiently.
The way to improve this is to combine some of the multiplies with some of the
adds. The 'C64x family of _dotp intrinsics provides the answer here.
Figure 8–18 illustrates how the _dotp2 intrinsic operates. Other _dotp intrin-
sics operate similarly.
16 bit
a
a_hi
*
b
b_hi
a_hi * b_hi
32 bit
c
a_hi * b_hi + a_lo * b_lo
This operation exactly maps to the operation the dot product kernel performs.
The modified version of the kernel absorbs two of the four ADDs into _dotp in-
trinsics. The result is shown as Example 8–11. Notice that the variable c has
been eliminated by summing the results of the _dotp intrinsic directly.
16 bit
a_lo
32–bit register
*
b_lo
32–bit register
a_lo * b_lo
32 bit
add
c = _dotp2(b, a)
32 bit
Need help?
Do you have a question about the TMS320C6000 and is the answer not in the manual?
Questions and answers