Avoiding Cross Path Stalls: Vector Sum Loop Kernel - Texas Instruments TMS320C6000 Programmer's Manual

Hide thumbs Also See for TMS320C6000:
Table of Contents

Advertisement

Example 8–26. Avoiding Cross Path Stalls: Vector Sum Loop Kernel
LOOP:
; PIPED LOOP KERNEL
AND
.L2X
||
SHR
.S1
||
MPY
.M1X
|| [ A1] B
.S2
|| [ A1] ADD
.L1
||
LDW
.D1T1
||
LDW
.D2T2
[ A2] MPYSU .M1
|| [!A2] STH
.D2T2
|| [!A2] STH
.D1T1
||
ADD
.L1X
||
ADD
.L2X
||
SHR
.S2
||
SHR
.S1
||
MPYHL .M2
The code above is sent to the assembly optimizer with the following compiler
options: –o3, –mi, –mt, –k, and –mg. Since a specific C6000 platform was not
specified , the default is to generate code for the 'C62x. The –o3 option enables
the highest level of the optimizer. The –mi option creates code with an interrupt
threshold equal to infinity. In other words, interrupts will never occur when this
code runs. The –k option keeps the assembly language file and –mt indicates
that the programmer is assuming no aliasing. Aliasing allows multiple pointers
to point to the same object). The –mg option allows profiling to occur in the de-
bugger for benchmarking purposes.
Example 8–26 below, is the assembly output generated by the assembly opti-
mizer for the weighted vector sum loop kernel:
A3,B6,B8
;AND bn & bn+1 with mask to isolate bn
A0,0xf,A0
B2,A5,A0
LOOP
0xffffffff,A1,A1
*A7++,A3
*B5++,B2
2,A2,A2
B1,*B4++(4)
A6,*A8++(4)
A4,B0,A6
B8,A0,B1
B9,0xf,B0
; shift prod1 right by 15 –> sprod1
A3,0x10,A4
; shift bn & bn+1 by 16 to isolate bn+1
B2,B7,B9
; multiply an+1 by a constant ; prod1
This two–cycle loop produces two 16–bit results per loop iteration as planned.
If the code is used on the 'C64x, be aware that in the first execute packet that
A0 (prod0) is shifted to the right by 15, causing the result to be written back into
A0. In the next execute packet and therefore the next clock cycle, A0 (sprod0)
is used as a cross path operand to the .L2 functional unit. If this code were run
on the 'C64x, it would exhibit a one cycle clock stall as described above. A0
in cycle 2 is being updated and used as a cross path operand in cycle 3. If the
code performs as planned, the two–cycle loop would now take three cycles to
execute.
The cross path stall can, in most cases, be avoided, if the –mv6400 option is
added to the compiler options list. This option indicates to the compiler/assem-
bly optimizer that the code below will be run on the 'C64x core.
; shift prod0 right by 15 –> sprod0
; multiply an by constant ; prod0
; branch to loop if loop count >0
; decrement loop count
; load 32–bits (bn & bn+1)
; load 32–bits (an & an+1)
;
; store 16–bits (cn+1)
; store 16–bits (cn)
; add sprod1 + bn+1
; add sprod0 + bn
'C64x Programming Considerations
Linear Assembly Considerations
8-53

Hide quick links:

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the TMS320C6000 and is the answer not in the manual?

Questions and answers

Table of Contents