Writing Parallel Code
6.3.5.2 Floating-Point Dot Product
Example 6–11. Nonparallel Assembly Code for Floating-Point Dot Product
MVK
.S1
ZERO
.L1
LOOP:
LDW
.D1
LDW
.D1
NOP
4
MPYSP .M1
NOP
3
ADDSP .L1
NOP
3
SUB
.S1
[A1] B
.S2
NOP
5
;
Branch occurs here
6-16
Rearranging the order of the instructions also improves the performance of the
code. The SUB instruction can take the place of one of the NOP delay slots
for the LDH instructions. Moving the B instruction after the SUB removes the
need for the NOP 5 used at the end of the code in Example 6–9.
The branch now occurs immediately after the ADD instruction so that the MPY
and ADD execute in parallel with the five delay slots required by the branch
instruction.
Similarly, Example 6–11 shows the nonparallel assembly code for the floating-
point dot product loop. The MVK instruction initializes the loop counter to 100.
The ZERO instruction clears the accumulator. The NOP instructions allow for
the delay slots of the LDW, ADDSP, MPYSP, and B instructions.
Executing this dot product code serially requires 21 cycles for each iteration
plus two cycles to set up the loop counter and initialize the accumulator; 100 it-
erations require 2102 cycles.
100, A1
; set up loop counter
A7
; zero out accumulator
*A4++,A2
; load ai from memory
*A3++,A5
; load bi from memory
; delay slots for LDW
A2,A5,A6
; ai * bi
; delay slots for MPYSP
A6,A7,A7
; sum += (ai * bi)
; delay slots for ADDSP
A1,1,A1
; decrement loop counter
LOOP
; branch to loop
; delay slots for branch
Assigning the same functional unit to both LDW instructions slows perfor-
mance of this loop. Therefore, reassign the functional units to execute the
code in parallel, as shown in the dependency graph in Figure 6–4. The parallel
assembly code is shown in Example 6–12.
Need help?
Do you have a question about the TMS320C6000 and is the answer not in the manual?
Questions and answers