Lesson 3: Packed Data Optimization Of Memory Bandwidth - Texas Instruments TMS320C6000 Programmer's Manual

Hide thumbs Also See for TMS320C6000:
Table of Contents

Advertisement

Lesson 3: Packed Data Optimization of Memory Bandwidth

The six memory accesses appear as .D and .T units. The four multiplies ap-
pear as .M units. The two shifts and the branch show up as .S units. The decre-
ment and the two adds appear as .LS and .LSD units. Due to partitioning, they
don't all show up as .LSD operations. Two of the adds must read one value
from the opposite side. Because this operation cannot be performed on the .D
unit, the two adds are listed as .LS operations.
By analyzing this part of the feedback, we can see that resources are most lim-
ited by the memory accesses; hence, the reason for an asterisk highlighting
the .D units and .T address paths.
Q Does this mean that we cannot make the loop operate any faster?
A Further insight into the 'C6000 architecture is necessary here.
The C62x fixed-point device loads and/or stores 32 bits every cycle. In addi-
tion, the C67x floating-point and 'C64x fixed-point device loads two 64-bit val-
ues each cycle. In our example, we load four 16-bit values and store two 16–bit
values every three cycles. This means we only use 32 bits of memory access
every cycle. Because this is a resource bottleneck in our loop, increasing the
memory access bandwidth further improves the performance of our loop.
In the unrolled loop generated from lesson2_c, we load two consecutive 16-bit
elements with LDHs from both the xptr and yptr array.
Q Why not use a single LDW to load one 32-bit element, with the resulting reg-
ister load containing the first element in one-half of the 32-bit register and the
second element in the other half?
A This is called Packed Data optimization. Two 16-bit loads are effectively per-
formed by one single 32-bit load instruction.
Q Why doesn't the compiler do this automatically in lesson2_c?
A Again, the answer lies in the amount of information the compiler has access
to from the local scope of lesson2_c.
In order to perform a LDW (32–bit load) on the 'C62x and 'C67x cores, the ad-
dress must be aligned to a word address; otherwise, incorrect data is loaded.
An address is word–aligned if the lower two bits of the address are zero. Unfor-
tunately, in our example, the pointers, xptr and yptr, are passed into lesson2_c
and there is no local scope knowledge as to their values. Therefore, the com-
piler is forced to be conservative and assume that these pointers might not be
aligned. Once again, we can pass more information to the compiler, this time
via the _nassert statement.
Open lesson3_c.c
Compiler Optimization Tutorial
2-19

Hide quick links:

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the TMS320C6000 and is the answer not in the manual?

Questions and answers

Subscribe to Our Youtube Channel

Table of Contents