Example 6–71. Final Assembly Code for FIR Filter With Redundant Load Elimination and
No Memory Hits With Outer Loop Software-Pipelined (Continued)
ADD
||
ADD
||
LDH
||
ADD
||
ADD
||
MVK
ADD
||
ADD
||
LDH
||
LDH
||[A2]
SUB
ADD
||
SHR
||
LDH
||
LDH
SHR
||
LDH
||
LDH
STH
||
STH
||
ZERO
||
ZERO
; outer loop branch occurs here
6.13.4 Comparing Performance
Table 6–26. Comparison of FIR Filter Code
Code Example
Example 6–64 FIR with redundant load elimination
Example 6–69 FIR with redundant load elimination and no memory
hits
Example 6–71 FIR with redundant load elimination and no memory
hits with outer loop software-pipelined
.D2
B7,B9,B9
.L1
A5,A9,A9
.D1
*A4++,B8
.L2X
A4,4,B1
.S1X
B4,2,A8
.S2
8,B2
.L2X
A7,B9,B9
.L1X
B8,A9,A9
.D2
*B1++[2],B0
.D1
*A4++[2],A0
.S1
A2,1,A2
.L2
B7,B9,B9
.S1
A9,15,A9
.D1
*A8++[2],B6
.D2
*B4++[2],A1
.S2
B9,15,B9
.D1
*A4++[2],A5
.D2
*B1++[2],B5
.D1
A9,*A6++[2]
.D2
B9,*B11++[2]
.S1
A9
.S2
B9
The improved cycle count for this loop is 2006 cycles: 50 ((7 4) + 6 + 6) + 6. The
outer-loop overhead for this loop has been reduced from 16 to 8 (6 + 6 – 4);
the – 4 represents one iteration less for the inner-loop iteration (seven instead
of eight).
;e sum1 += x2 * h1
;e sum0 += x2 * h2
;p x0 = x[j]
;o set up pointer to x[j+2]
;o set up pointer to h[1]
;o set up inner loop counter
;e sum1 += x3 * h2
;e sum0 += x3 * h3
;p x2 = x[j+i+2]
;p x1 = x[j+i+1]
;o decrement outer loop counter
;e sum1 += x0 * h3
;e sum0 >> 15
;p h1 = h[i+1]
;p h0 = h[i]
;e sum1 >> 15
;p x3 = x[j+i+3]
;p x0 = x[j+i+4]
;e y[j] = sum0 >> 15
;e y[j+1] = sum1 >> 15
;o zero out sum0
;o zero out sum1
Cycles
50 (16
50 (8
50 (7
Optimizing Assembly Code via Linear Assembly
Software Pipelining the Outer Loop
Cycle Count
2 + 9 + 6) + 2
4 + 10 + 6) + 2
4 + 6 + 6) + 6
2352
2402
2006
6-135
Need help?
Do you have a question about the TMS320C6000 and is the answer not in the manual?
Questions and answers