Texas Instruments TMS320C6000 Programmer's Manual
Texas Instruments TMS320C6000 Programmer's Manual

Texas Instruments TMS320C6000 Programmer's Manual

Hide thumbs Also See for TMS320C6000:
Table of Contents

Advertisement

TMS320C6000
Programmer's Guide
Literature Number: SPRU198E
October 2000
Printed on Recycled Paper

Advertisement

Table of Contents
loading

Summary of Contents for Texas Instruments TMS320C6000

  • Page 1 TMS320C6000 Programmer’s Guide Literature Number: SPRU198E October 2000 Printed on Recycled Paper...
  • Page 2 IMPORTANT NOTICE Texas Instruments (TI) reserves the right to make changes to its products or to discontinue any semiconductor product or service without notice, and advises its customers to obtain the latest version of relevant information to verify, before placing orders, that the information being relied on is current.
  • Page 3 Preface Read This First About This Manual This manual is a reference for programming TMS320C6000 digital signal pro- cessor (DSP) devices. Before you use this book, you should install your code generation and debug- ging tools. This book is organized in five major parts: Part I: Introduction includes a brief description of the ’C6000 architecture...
  • Page 4 Related Documentation From Texas Instruments Related Documentation From Texas Instruments The following books describe the TMS320C6000 devices and related support tools. To obtain a copy of any of these TI documents, call the Texas Instru- ments Literature Response Center at (800) 477–8924. When ordering, please identify the book by its title and literature number.
  • Page 5 Trademarks Trademarks Solaris and SunOS are trademarks of Sun Microsystems, Inc. VelociTI is a trademark of Texas Instruments Incorporated. Windows and Windows NT are registered trademarks of Microsoft Corporation. Read This First...
  • Page 6: Table Of Contents

    ........... Uses example code to walk you through the code development flow for the TMS320C6000.
  • Page 7 Contents Linking Issues ..............Explains linker messages and how to use RTS functions.
  • Page 8 Contents Using Word Access for Short Data and Doubleword Access for Floating-Point Data ............6-19 6.4.1 Unrolled Dot Product C Code...
  • Page 9 Contents Loop Unrolling ............6-94 6.9.1 Unrolled If-Then-Else C Code...
  • Page 10 Contents Interrupts ............... Describes interrupts from a software programming point of view.
  • Page 11 Figures Figures 3–1 Dependency Graph for Vector Sum #1 ......... . . 3–2 Software-Pipelined Loop .
  • Page 12 Figures 6–23 4-Bank Interleaved Memory With Two Memory Blocks ......6-119 6–24 Dependency Graph of FIR Filter (With Even and Odd Elements of Each Array on Same Loop Cycle) .
  • Page 13 3–6 TMS320C6000 C/C++ Compiler Intrinsics ........
  • Page 14 Tables 6–18 Resource Table for If-Then-Else Code ......... . 6-89 6–19 Comparison of If-Then-Else Code Examples...
  • Page 15 Examples Examples 1–1 Compiler and/or Assembly Optimizer Feedback ........1–2 Stage 1 Feedback .
  • Page 16 Examples 3–15 Float Dot Product Using Intrinsics ..........3-31 3–16 Float Dot Product With Peak Performance...
  • Page 17 Examples 6–21 Linear Assembly for Fixed-Point Dot Product Inner Loop (With Conditional SUB Instruction) ..........6-30 6–22 Linear Assembly for Floating-Point Dot Product Inner Loop...
  • Page 18 Examples 6–58 Linear Assembly for Full Live-Too-Long Code ........6-107 6–59 Assembly Code for Live-Too-Long With Move Instructions...
  • Page 19 Examples 8–11 Final Assembly Code for Dot–Product Kernel’s Inner Loop ......8-31 8–12 Vectorized form of the Vector Complex Multiply Kernel .
  • Page 20: Introduction

    Topic Page TMS320C6000 Architecture ........
  • Page 21: Tms320C6000 Architecture

    TMS320C6000 Architecture TMS320C6000 Architecture / TMS320C6000 Pipeline 1.1 TMS320C6000 Architecture The ’C62x is a fixed-point digital signal processor (DSP) and is the first DSP to use the VelociTIt architecture. VelociTI is a high-performance, advanced very-long-instruction-word (VLIW) architecture, making it an excellent choice for multichannel, multifunctional, and performance-driven applications.
  • Page 22: Code Development Flow To Increase Performance

    Code Development Flow To Increase Performance 1.3 Code Development Flow To Increase Performance Traditional development flows in the DSP industry have involved validating a C model for correctness on a host PC or Unix workstation and then painstak- ingly porting that C code to hand coded DSP assembly language. This is both time consuming and error prone.
  • Page 23 Code Development Flow To Increase Performance You can achieve the best performance from your ’C6000 code if you follow this code development flow when you are writing and debugging your code: Phase 1: Write C code Develop C Code Compile Profile Efficient? Complete...
  • Page 24 Code Development Flow To Increase Performance The following table lists the phases in the 3-step software development flow shown on the previous page, and the goal for each phase: Phase Goal You can develop your C code for phase 1 without any knowledge of the ’C6000.
  • Page 25: Code Development Steps

    Code Development Flow To Increase Performance Table 1–1, Code Development Steps , describes the recommended code de- velopment flow for developing code which achieves the highest performance on loops. Table 1–1. Code Development Steps Step Description Compile and profile native C/C++ code Validates original C/C++ code Phase Determines which loops are most important in terms of MIPS require-...
  • Page 26: Optimizing Assembly Code Via Linear Assembly

    C/C++ can offer, works within the framework of C/C++, and is much like pro- gramming in higher level C. For more information on the assembly optimizer, see the TMS320C6000 Optimizing C/C++ Compiler User’s Guide and Chapter 6, Optimizing Assembly Code via Linear Assembly , in this book.
  • Page 27: Compiler And/Or Assembly Optimizer Feedback

    Code Development Flow To Increase Performance Example 1–1. Compiler and/or Assembly Optimizer Feedback ;*––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––* SOFTWARE PIPELINE INFORMATION Known Minimum Trip Count Known Maximum Trip Count Known Max Trip Count Factor Loop Carried Dependency Bound(^) : 4 Unpartitioned Resource Bound Partitioned Resource Bound(*) Resource Partition: A–side B–side...
  • Page 28: Understanding Feedback

    Understanding Feedback 1.4 Understanding Feedback The compiler provides some feedback by default. Additional feedback is gen- erated with the -mw option. The feedback is located in the .asm file that the compiler generates. In order to view the feedback, you must also enable -k which retains the .asm output from the compiler.
  • Page 29 Understanding Feedback Maximum Trip Count Factor. The maximum number that will divide evenly into the trip count. Even though the exact value of the trip count is not deterministic, it may be known that the value is a multiple of 2, 4, etc..., which allows more agressive packed data and unrolling optimization.
  • Page 30: Stage 2: Collect Loop Resource And Dependency Graph Information

    Understanding Feedback 1.4.2 Stage 2: Collect Loop Resource and Dependency Graph Information The second stage of software pipelining a loop is collecting loop resource and dependency graph information. The results of stage 2 will be displayed in the feedback window as follows: Example 1–3.Stage 2 Feedback Loop Carried Dependency Bound(^) : 4 Unpartitioned Resource Bound...
  • Page 31 Understanding Feedback Unpartitioned resource bound across all resources. The best case re- source bound minimum iteration interval before the compiler has parti- tioned each instruction to the A or B side. In Example 1–3, the unparti- tioned resource bound is 4 because the .S units are required for 8 cycles, and there are 2 .S units.
  • Page 32 Understanding Feedback Logical ops (.LS) represents the total number of instructions that can use either the .L or .S unit. Addition ops (.LSD) represents the total number of instructions that can use either the .L or .S or .D unit. Bound (.L .S .LS) represents the resource bound value as deter- mined by the number of instructions that use the .L and .S units.
  • Page 33: Stage 3: Software Pipeline The Loop

    Understanding Feedback 1.4.3 Stage 3: Software Pipeline the Loop Once the compiler has completed qualification of the loop, partitioned it, and analyzed the necessary loop carry and resource requirements, it can begin to attempt software pipelining. This section will focus on the following lines from the feedback example: Example 1–4.Stage 3 Feedback Searching for software pipeline schedule at ...
  • Page 34 Understanding Feedback Sometimes the compiler finds a valid software pipeline schedule but one or more of the values is live too long. Lifetime of a register is determined by the cycle a value is written into it and by the last cycle this value is read by another instruction.
  • Page 35 Minimum required memory pad : 2 bytes The minimum required memory padding to use -mh is 2 bytes. See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more informa- tion on the -mh option and the minimum required memory padding.
  • Page 36: Compiler Optimization Tutorial

    Chapter 2 Compiler Optimization Tutorial This chapter walks you through the code development flow and introduces you to compiler optimization techniques that were introduced in Chapter 1. It uses step-by-step instructions and code examples to show you how to use the soft- ware development tools in each phase of development.
  • Page 37: Introduction: Simple C Tuning

    Introduction: Simple C Tuning 2.1 Introduction: Simple C Tuning The ’C6000 compiler delivers the industry’s best ”out of the box” C perfor- mance. In addition to performing many common DSP optimizations, the ’C6000 compiler also performs software pipelining on various MIPS intensive loops.
  • Page 38 Introduction: Simple C Tuning 3) Click on the ”Add to system configuration” button. 4) Click on the close button and exit setup. 5) Save the configuration on exit. Load the Tutorial Workspace 1) Start Code Composer Studio. 2) From the Project menu, select Open. Browse to: ti\tutorial\sim62xx\optimizing_c\ 3) Select optimizing_c.pjt , and click Open.
  • Page 39 Introduction: Simple C Tuning You can see cycle counts of 414, 98, 79, and 55 for functions in tutor1–4, run- ning on the C6xxx simulator. Each of these functions contains the same C code but has some minor differences related to the amount of information to which the compiler has access.
  • Page 40: Lesson 1: Loop Carry Path From Memory Pointers

    Lesson 1: Loop Carry Path From Memory Pointers 2.2 Lesson 1: Loop Carry Path From Memory Pointers Open lesson_c.c In the Project View window, right–click on lesson_c.c and select Open. Example 2–2. lesson_c.c void lesson_c(short *xptr, short *yptr, short *zptr, short *w_sum, int N) { int i, w_vec1, w_vec2;...
  • Page 41: Feedback From Lesson_C.asm

    Lesson 1: Loop Carry Path From Memory Pointers Example 2–3. Feedback From lesson_c.asm ;*––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––* SOFTWARE PIPELINE INFORMATION Known Minimum Trip Count Known Max Trip Count Factor Loop Carried Dependency Bound(^) : 10 Unpartitioned Resource Bound Partitioned Resource Bound(*) Resource Partition: A–side B–side .L units...
  • Page 42: Lesson_C.asm

    Lesson 1: Loop Carry Path From Memory Pointers A schedule with ii = 10, implies that each iteration of the loop takes ten cycles. Obviously, with eight resources available every cycle on such a small loop, we would expect this loop to do better than this. Q Where are the problems with this loop? A A closer look at the feedback in lesson_c.asm gives us the answer.
  • Page 43: Lesson1_C.c

    No other pointer in lesson1_c.c points to xptr and no other pointer in lesson1_c.c points to yptr. See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more information on the restrict type qualifi- er.
  • Page 44: Lesson1_C.asm

    Lesson 1: Loop Carry Path From Memory Pointers Example 2–6. lesson1_c.asm ;*––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––* SOFTWARE PIPELINE INFORMATION Known Minimum Trip Count Known Max Trip Count Factor Loop Carried Dependency Bound(^) : 0 Unpartitioned Resource Bound Partitioned Resource Bound(*) Resource Partition: A–side B–side .L units .S units .D units...
  • Page 45 Lesson 1: Loop Carry Path From Memory Pointers At this point, the Loop Carried Dependency Bound is zero. By simply passing more information to the compiler, we allowed it to improve a 10–cycle loop to a 2–cycle loop. Lesson 4 in this tutorial shows how the compiler retrieves this type of informa- tion automatically by gaining full view of the entire program with program level optimization switches.
  • Page 46: Status Update: Tutorial Example Lesson_C Lesson1_C

    Lesson 1: Loop Carry Path From Memory Pointers Table 2–1. Status Update: Tutorial example lesson_c lesson1_c Tutorial Example Lesson_c Lesson1_c Potential pointer aliasing info (discussed in Lesson 1) Loop count info – minimum trip count (discussed in Lesson 2) Loop count info – max trip count factor (discussed in Lesson 2) Alignment info –...
  • Page 47: Lesson 2: Balancing Resources With Dual-Data Paths

    Lesson 2: Balancing Resources With Dual-Data Paths 2.3 Lesson 2: Balancing Resources With Dual-Data Paths Lesson 1 showed you a simple way to make large performance gains in les- son_c. The result is lesson1_c with a 2–cycle loop. Q Is this the best the compiler can do? Is this the best that is possible on the VelociTI architecture? A Again, the answers lie in the amount of knowledge to which the compiler has access.
  • Page 48: Lesson1_C.asm

    Lesson 2: Balancing Resources With Dual-Data Paths Example 2–7. lesson1_c.asm ;*––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––* SOFTWARE PIPELINE INFORMATION Known Minimum Trip Count Known Max Trip Count Factor Loop Carried Dependency Bound(^) : 0 Unpartitioned Resource Bound Partitioned Resource Bound(*) Resource Partition: A–side B–side .L units .S units .D units .M units...
  • Page 49 Lesson 2: Balancing Resources With Dual-Data Paths The first iteration interval (ii) attempted was two cycles because the Partitioned Resource Bound is two. We can see the reason for this if we look below at the .D units and the .T address paths. This loop requires two loads (from xptr and yptr) and one store (to w_sum) for each iteration of the loop.
  • Page 50: Lesson2_C.c

    The second argument is the maximum number of times the loop will iterate. The trip count must be evenly divisible by the third argument. See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more information about the MUST_ITERATE prag- For this example, we chose a trip count large enough to tell the compiler that it is more efficient to unroll.
  • Page 51: Lesson2_C.asm

    Lesson 2: Balancing Resources With Dual-Data Paths Example 2–9. lesson2_c.asm ;*––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––* SOFTWARE PIPELINE INFORMATION Loop Unroll Multiple : 2x Known Minimum Trip Count : 10 Known Maximum Trip Count : 1073741823 Known Max Trip Count Factor Loop Carried Dependency Bound(^) : 0 Unpartitioned Resource Bound Partitioned Resource Bound(*) Resource Partition:...
  • Page 52: Status Update: Tutorial Example Lesson_C Lesson1_C Lesson2_C

    Lesson 2: Balancing Resources With Dual-Data Paths Count, is displayed in the feedback. This represents the maximum signed inte- ger value divided by two, or 3FFFFFFFh. Therefore, by passing information without modifying the loop code, compiler performance improves from a 10–cycle loop to 2 cycles and now to 1.5 cycles. Q Is this the lower limit? A Check out Lesson 3 to find out! Table 2–2.
  • Page 53 Lesson 3: Packed Data Optimization of Memory Bandwidth 2.4 Lesson 3: Packed Data Optimization of Memory Bandwidth Lesson 2 produced a 3–cycle loop that performed two iterations of the original vector sum of two weighted vectors . This means that each iteration of our loop now performs six memory accesses, four multiplies, two adds, two shift opera- tions, a decrement for the loop counter, and a branch.
  • Page 54: Lesson 3: Packed Data Optimization Of Memory Bandwidth

    Lesson 3: Packed Data Optimization of Memory Bandwidth The six memory accesses appear as .D and .T units. The four multiplies ap- pear as .M units. The two shifts and the branch show up as .S units. The decre- ment and the two adds appear as .LS and .LSD units. Due to partitioning, they don’t all show up as .LSD operations.
  • Page 55: Lesson3_C.c

    Lesson 3: Packed Data Optimization of Memory Bandwidth Example 2–11. lesson3_c.c #define WORD_ALIGNED(x) (_nassert(((int)(x) & 0x3) == 0)) void lesson3_c(short * restrict xptr, short * restrict yptr, short *zptr, short *w_sum, int N) int i, w_vec1, w_vec2; short w1,w2; WORD_ALIGNED(xptr); WORD_ALIGNED(yptr);...
  • Page 56: Lesson3_C.asm

    Lesson 3: Packed Data Optimization of Memory Bandwidth Example 2–12. lesson3_c.asm ;*––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––* SOFTWARE PIPELINE INFORMATION Loop Unroll Multiple : 2x Known Minimum Trip Count : 10 Known Maximum Trip Count : 1073741823 Known Max Trip Count Factor Loop Carried Dependency Bound(^) : 0 Unpartitioned Resource Bound Partitioned Resource Bound(*) Resource Partition:...
  • Page 57: Status Update: Tutorial Example Lesson_C Lesson1_C Lesson2_C Lesson3_C

    Lesson 3: Packed Data Optimization of Memory Bandwidth Table 2–3. Status Update: Tutorial example lesson_c lesson1_c lesson2_c lesson3_c Tutorial Example Lesson_c Lesson1_c Lesson2_c Lesson3_c Potential pointer aliasing info (discussed in Les- son 1) Loop count info – minimum trip count (discussed in Lesson 2) Loop count info –...
  • Page 58: Lesson 4: Program Level Optimization

    Lesson 4: Program Level Optimization 2.5 Lesson 4: Program Level Optimization In Lesson 3, you learned how to pass information to the compiler. This in- creased the amount of information visible to the compiler from the local scope of each function. Q Is this necessary in all cases? A The answer is no, not in all cases.
  • Page 59: Status Update: Tutorial Example Lesson_C Lesson1_C Lesson2_C Lesson3_C

    Lesson 4: Program Level Optimization Example 2–13. Profile Statistics Location Count Average Total Maximum Minimum lesson_c.c line 27 5020.0 5020 5020 5020 lesson_c.c line 36 60.0 lesson1_c.c line 37 60.0 lesson2_c.c line 39 60.0 lesson3_c.c line 44 60.0 lesson1_c.c line 27 12.0 lesson2_c.c line 29 12.0...
  • Page 60: Lesson 5: Writing Linear Assembly

    Lesson 5: Writing Linear Assembly 2.6 Lesson 5: Writing Linear Assembly When the compiler does not fully exploit the potential of the ’C6000 architec- ture, you may be able to get better performance by writing your loop in linear assembly. Linear assembly is the input for the assembly optimizer. Linear assembly is similar to regular ’C6000 assembly code in that you use ’C6000 instructions to write your code.
  • Page 61: Using The Iircas4 Function In C

    Lesson 5: Writing Linear Assembly Let’s look at a new example, iircas4, which will show the benefit of using linear assembly. The compiler does not not optimally partition this loop. Thus, the iir- cas4 function does not improve with the C modification techniques we saw in the first portion of the chapter.
  • Page 62: Software Pipelining Feedback From The Iircas4 C Code

    Lesson 5: Writing Linear Assembly Example 2–15. Software Pipelining Feedback From the iircas4 C Code ;*–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––* SOFTWARE PIPELINE INFORMATION Known Minimum Trip Count : 10 Known Max Trip Count Factor Loop Carried Dependency Bound(^) : 2 Unpartitioned Resource Bound Partitioned Resource Bound(*) Resource Partition: A–side B–side...
  • Page 63 Lesson 5: Writing Linear Assembly Resource Bound is higher, this usually means we can make a better partition by writing the code in linear assembly. Notice that there are 5 cross path reads on the A side and only 3 on the B side. We would like 4 cross path reads on the A side and 4 cross path reads on the B side.
  • Page 64: Rewriting The Iircas4 ( ) Function In Linear Assembly

    Lesson 5: Writing Linear Assembly Example 2–16. Rewriting the iircas4 ( ) Function in Linear Assembly .def _iircas4_sa _iircas4_sa: .cproc AI,C,BD,AY .no_mdep .reg BD0,BD1,AA,AB,AJ0,AF0,AE0,AG0,AH0,AY0,AK0,AM0,BD00 .reg BA2,BB2,BJ1,BF1,BE1,BG1,BH1,BY1,BK1,BM1 *+AY[0],AY0 *+AY[1],BY1 .mptr bank+0, 8 .mptr BD, bank+4, 8 LOOP: .trip .D2T1 *C++, AA ;...
  • Page 65: Software Pipeline Feedback From Linear Assembly

    Lesson 5: Writing Linear Assembly The following example shows the software pipeline feedback from Example 2–16. Example 2–17. Software Pipeline Feedback from Linear Assembly ;*––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––* SOFTWARE PIPELINE INFORMATION Loop label : LOOP Known Minimum Trip Count : 10 Known Max Trip Count Factor Loop Carried Dependency Bound(^) : 3 Unpartitioned Resource Bound Partitioned Resource Bound(*)
  • Page 66: Optimizing C/C++ Code

    Chapter 3 Optimizing C/C++ Code You can maximize C/C++ performance by using compiler options, intrinsics, and code transformations. This chapter discusses the following topics: The compiler and its options Intrinsics Software pipelining Loop unrolling Topic Page Writing C/C++ Code ......... . . Compiling C/C++ Code .
  • Page 67: Writing C/C++ Code

    Writing C/C++ Code 3.1 Writing C/C++ Code This chapter shows you how to analyze and tailor your code to be sure you are getting the best performance from the ’C6000 architecture. 3.1.1 Tips on Data Types Give careful consideration to the data type size when writing your code. The ’C6000 compiler defines a size for each data type (signed and unsigned): char 8 bits...
  • Page 68: Analyzing C Code Performance

    –mg option and executing load6x with the –g option. The profile results will be stored in a file with the .vaa extension. Refer to the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more information. Enable the clock and use profile points and the RUN command in the Code Composer debugger to track the number of CPU clock cycles consumed by a particular section of code.
  • Page 69: Compiling C/C++ Code

    [ options ] [ filenames ] [–z [ linker options ] [ object files ]] For a complete description of the C/C++ compiler and the options discussed in this chapter, see the TMS320C6000 Optimizing C/C++ Compiler User’s Guide (SPRU187) .
  • Page 70: Compiler Options For Performance

    Compiling C/C++ Code The options in Table 3–2 can improve performance but require certain charac- teristics to be true, and are described below. Table 3–2. Compiler Options for Performance Option Description † –o3 Represents the highest level of optimization available. Various loop optimizations are performed, such as software pipelining, unrolling, and SIMD.
  • Page 71: Compiler Options For Control Code

    Compiling C/C++ Code The options described in Table 3–4 are recommended for control code, and will result in smaller code size with minimal performance degradation. Table 3–4. Compiler Options for Control Code Option Description † –o3 In addition to the optimizations described in Table 3–2, -o3 can perform other code size reducing optimizations like: eliminating unused assignments, eliminating local and global common sub- expressions, and removing functions that are never called.
  • Page 72: Memory Dependencies

    .no_mdep directive to the linear assembly source file. Specific memory dependencies should be specified with the .mdep directive. For more information see section 4.4, Assembly Optimizer Directives in the TMS320C6000 Optimizing C/C++ Compiler User’s Guide . Optimizing C/C++ Code...
  • Page 73: Dependency Graph For Vector Sum #1

    Compiling C/C++ Code To illustrate the concept of memory dependencies, it is helpful to look at the algorithm code in a dependency graph. Example 3–1 shows the C code for a basic vector sum. Figure 3–1 shows the dependency graph for this basic vec- tor sum.
  • Page 74: Use Of The Restrict Type Qualifier With Pointers

    Compiling C/C++ Code The dependency graph in Figure 3–1 shows that: The paths from sum[i] back to in1[i] and in2[i] indicate that writing to sum may have an effect on the memory pointed to by either in1 or in2. A read from in1 or in2 cannot begin until the write to sum finishes, which creates an aliasing problem.
  • Page 75: Use Of The Restrict Type Qualifier With Arrays

    Compiling C/C++ Code Example 3–3.Use of the Restrict Type Qualifier With Arrays void func1(int c[restrict], int d[restrict]) int i; for(i = 0; i < 64; i++) c[i] += d[i]; d[i] += 1; Do not use the restrict keyword with code such as listed in Example 3–4. By using the restrict keyword in Example 3–4, you are telling the compiler that it is legal to write to any location pointed to by a before reading the location pointed to by b .
  • Page 76 If your code does not follow the assumptions generated by the –mt option, you can get incorrect results. For more information on the –mt option refer to the TMS320C6000 Optimizing Compiler User’s Guide (SPRU187) . Optimizing C/C++ Code 3-11...
  • Page 77: Performing Program-Level Optimization (–Pm Option)

    Compiling C/C++ Code 3.2.3 Performing Program-Level Optimization (–pm Option) You can specify program-level optimization by using the –pm option with the –o3 option. With program-level optimization, all your source files are compiled into one intermediate file giving the compiler complete program view during compilation.
  • Page 78: Profiling Your Code

    Profiling Your Code 3.3 Profiling Your Code In large applications, it makes sense to optimize the most important sections of code first. You can use the information generated by profiling options to get started. You can use several different methods to profile your code. These methods are described below.
  • Page 79: Including The Clock( ) Function

    Profiling Your Code Count represents the number of times each function was called and entered. Inclusive represents the total cycle time spent inside that function including calls to other functions. Incl–Max (Inclusive Max) represents the longest time spent inside that function during one call. Exclusive and Excl–Max are the same as Inclusive and Incl–Max except that time spent in calls to other func- tions inside that function have been removed.
  • Page 80: Refining C/C++ Code

    Refining C/C++ Code 3.4 Refining C/C++ Code You can realize substantial gains from the performance of your C/C++ code by refining your code in the following areas: Using intrinsics to replace complicated C/C++ code Using word access to operate on 16-bit data stored in the high and low parts of a 32-bit register Using double access to operate on 32-bit data stored in a 64-bit register pair (’C64x and ’C67x only)
  • Page 81: Tms320C6000 C/C++ Compiler Intrinsics

    Refining C/C++ Code Table 3–6 lists the ’C6000 intrinsics. For more information on using intrinsics, see the TMS320C6000 Optimizing C/C++ Compiler User’s Guide . Table 3–6. TMS320C6000 C/C++ Compiler Intrinsics Assembly C Compiler Intrinsic Description Device Instruction int _abs(int src2 );...
  • Page 82 Refining C/C++ Code Table 3–6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) Assembly C Compiler Intrinsic Description Device Instruction int _cmpeq4 (int src1 , int src2 ); CMPEQ4 Performs equality comparisons on each ’C64x pair of 8–bit values. Equality results are packed into the four least–significant bits of the return value.
  • Page 83 Refining C/C++ Code Table 3–6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) Assembly C Compiler Intrinsic Description Device Instruction int_dpint(double); DPINT Converts 64-bit double to 32-bit signed in- ’C67x teger, using the rounding mode set by the CSR register. int _ext(int src2, uint csta , int cstb );...
  • Page 84 Refining C/C++ Code Table 3–6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) Assembly C Compiler Intrinsic Description Device Instruction float _itof(uint); Reinterprets the bits in the unsigned inte- ger as a float. (Ex: _itof(0x3f800000) == 1.0) double & _memd8(void * ptr); LDNDW/ Allows unaligned loads and stores of 8 by- ’C64x...
  • Page 85 Refining C/C++ Code Table 3–6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) Assembly C Compiler Intrinsic Description Device Instruction int _mpy(int src1, int src2 ); Multiplies the 16 LSBs of src1 by the 16 int _mpyus(uint src1, int src2 ); MPYUS LSBs of src2 and returns the result. Values int _mpysu(int src1, uint src2 );...
  • Page 86 Refining C/C++ Code Table 3–6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) Assembly C Compiler Intrinsic Description Device Instruction double _rcpdp(double); RCPDP Computes the approximate 64-bit double ’C67x reciprocal. float _rcpsp(float); RCPSP Computes the approximate 64-bit double ’C67x reciprocal. unsigned _rotl (uint src2 , uint src1 );...
  • Page 87 Refining C/C++ Code Table 3–6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) Assembly C Compiler Intrinsic Description Device Instruction unsigned _shlmb (uint src1 , uint src2 ); SHLMB Shifts src2 left/right by one byte, and the ’C64x unsigned _shrmb (uint src1 , uint src2 );...
  • Page 88 Refining C/C++ Code Table 3–6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) Assembly C Compiler Intrinsic Description Device Instruction int _sub2(int src1 , int src2 ); SUB2 Subtracts the upper and lower halves of src2 from the upper and lower halves of src1, and returns the result.
  • Page 89: Using Word Access For Short Data

    Refining C/C++ Code 3.4.2 Using Word Access for Short Data The ’C6000 has instructions with corresponding intrinsics, such as _add2( ), _mpyhl( ), _mpylh( ), that operate on 16-bit data stored in the high and low parts of a 32-bit register. When operating on a stream of short data, you can use word (int) accesses to read two short values at a time, and then use ’C6x intrinsics to operate on the data.
  • Page 90: Vector Sum With Non–Aligned Word Accesses To Memory

    Refining C/C++ Code Example 3–9. Vector Sum With Non–aligned Word Accesses to Memory void vecsum4a(short *restrict sum, const short *restrict in1, const short *restrict in2, unsigned int N) int i; #pragma MUST_ITERATE (10) for (i = 0; i < N; i += 2) _mem4((void *)&sum[i]) = _add2(_mem4((void *)&in1[i]), _mem4((void *)&in2[i]));...
  • Page 91: Vector Sum With Restrict Keywords, Must_Iterate Pragma And Word Reads (Generic Version)

    Refining C/C++ Code If a vecsum( ) function is needed to handle short-aligned data and odd-num- bered loop counters, then you must add code within the function to check for these cases. Knowing what type of data is passed to a function can improve performance considerably.
  • Page 92: Dot Product Using Intrinsics

    Refining C/C++ Code 3.4.2.1 Using Word Access in Dot Product Other intrinsics that are useful for reading short data as words are the multiply intrinsics. Example 3–11 is a dot product example that reads word-aligned short data and uses the _mpy( ) and _mpyh( ) intrinsics. The _mpyh( ) intrin- sic uses the ’C6000 instruction MPYH, which multiplies the high 16 bits of two registers, giving a 32-bit result.
  • Page 93: Fir Filter— Original Form

    Refining C/C++ Code 3.4.2.2 Using Word Access in FIR Filter Example 3–12 shows an FIR filter that can be optimized with word reads of short data and multiply intrinsics. Example 3–12. FIR Filter—Original Form void fir1(const short x[restrict], const short h[restrict], short y[restrict], int n, int m, int s) int i, j;...
  • Page 94: Fir Filter — Optimized Form

    Refining C/C++ Code Example 3–13. FIR Filter— Optimized Form void fir2(const int x[restrict], const int h[restrict], short y[restrict], int n, int m, int s) int i, j; long y0, y1; long round = 1L << (s – 1); #pragma MUST_ITERATE (8); for (j = 0;...
  • Page 95: Basic Float Dot Product

    Refining C/C++ Code 3.4.2.3 Using Double Word Access for Word Data (’C64x and ’C67x Specific) The ’C64x and ’C67x families have a load double word (LDDW) instruction, which can read 64 bits of data into a register pair. Just like using word accesses to read 2 short data items, double word accesses can be used to read 2 word data items (or 4 short data items).
  • Page 96: Float Dot Product Using Intrinsics

    Refining C/C++ Code Example 3–15. Float Dot Product Using Intrinsics float dotprod2(const double a[restrict], const double b[restrict]) int i; float sum0 = 0; float sum1 = 0; for (i=0; i<512/2; i++) sum0 += _itof(_hi(a[i])) * _itof(_hi(b[i])); sum1 += _itof(_lo(a[i])) * _itof(_lo(b[i])); return sum0 + sum1;...
  • Page 97: Float Dot Product With Peak Performance

    Refining C/C++ Code Example 3–16. Float Dot Product With Peak Performance #define FHI(a) _itof(_hi(a)) #define FLO(a) _itof(_lo(a)) float dotp3(const double a[restrict], const double b[restrict]) int i; float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; float sum4 = 0;...
  • Page 98: Int Dot Product With Nonaligned Doubleword Reads

    Refining C/C++ Code In Example 3–17, the dot product example has been rewritten for c64xx. This demonstrates how it is possible to perform doubleword nonaligned memory reads on a dot product that always executes a multiple of 4 times. Example 3–17. Int Dot Product with Nonaligned Doubleword Reads int dotp4(const short *restrict a, const short *restrict b, unsigned int N) int i, sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;...
  • Page 99: Using The Compiler To Generate A Dot Product With Word Accesses

    Refining C/C++ Code 3.4.2.4 Using _nassert(), Word Accesses, and the MUST_ITERATE pragma It is possible for the compiler to automatically perform packed data optimiza- tions for some, but not all loops. By either using global arrays, or by using the _nassert() intrinsic to provide alignment information about your pointers, the compiler can transform your code to use word accesses and the ‘C6000 intrin- sics.
  • Page 100: Using The _Nassert() Intrinsic To Generate Word Accesses For Vector Sum

    40 times (the first argument), and a maximum of 40 times (the sec- ond argument). An optional third argument tells the compiler what the trip count is a multiple of. See the TMS320C6000 C/C++ Compiler User’s Guide for more information about the MUST_ITERATE pragma.
  • Page 101: Using _Nassert() Intrinsic To Generate Word Accesses For Fir Filter

    Refining C/C++ Code Example 3–20. Using _nassert() Intrinsic to Generate Word Accesses for FIR Filter void fir (const short x[restrict], const short h[restrict], short y[restrict] int n, int m, int s) int i, j; long y0; long round = 1L << (s - 1); _nassert(((int)x &...
  • Page 102: Compiler Output From Example 3–13

    Refining C/C++ Code Example 3–22. Compiler Output From Example 3–13 ; PIPED LOOP KERNEL B3,B5:B4,B5:B4 A3,A5:A4,A5:A4 B1,B2 .M2X B1,A8,B3 MPYHL .M1X B1,A8,A3 || [ A1] || [ B0] .D2T2 *B8,B1 [ B0] B0,1,B0 A3,A7:A6,A7:A6 B3,B7:B6,B7:B6 MPYH .M1X B2,A8,A3 MPYHL .M2X A8,B9,B3 || [ A1] A1,1,A1...
  • Page 103: Automatic Use Of Word Accesses Without The _Nassert Intrinsic

    Refining C/C++ Code If your code operates on global arrays as in Example 3–24, and you build your application with the -pm and -o3 options, the compiler will have enough infor- mation (trip counts and alignments of variables) to determine whether or not packed-data processing optimization is feasible.
  • Page 104 Refining C/C++ Code Below is the resulting assembly file (file1.asm). Notice that the dot product loop uses word accesses and the ‘C6000 intrinsics. ; PIPED LOOP KERNEL [!A1] B6,B7,B7 || [!A1] A6,A0,A0 .M2X B5,A4,B6 MPYH .M1X B5,A4,A6 || [ B0] .D1T1 *+A5(4),A4 .D2T2...
  • Page 105: Software-Pipelined Loop

    Refining C/C++ Code 3.4.3 Software Pipelining Software pipelining is a technique used to schedule instructions from a loop so that multiple iterations of the loop execute in parallel. When you use the –o2 and –o3 compiler options, the compiler attempts to software pipeline your code with information that it gathers from your program.
  • Page 106: Trip Counters

    Alternatively, the user can provide this in- formation using the MUST_ITERATE and PROB_ITERATE pragma. For more information about pragmas, see the TMS320C6000 Optimizing C/C++ Com- piler User’s Guide (SPRU187). The minimum safe trip count is the number of iterations of the loop that are nec- essary to safely execute the software pipelined version of the loop.
  • Page 107 The user can increase the compiler’s ability to perform this optimization by us- ing the -mh, or -mhn option whenever possible. See the TMS320C6000 Opti- mizing C/C++ Compiler User’s Guide for more information about options.
  • Page 108 Refining C/C++ Code 3.4.3.3 Communicating Trip-Count Information to the Compiler When invoking the compiler, use the following options to communicate trip- count information to the compiler: Use the –o3 and –pm compiler options to allow the optimizer to access the whole program or large parts of it and to characterize the behavior of loop trip counts.
  • Page 109: Vector Sum With Three Memory Operations

    _nassert(((int) c & 0x7) == 0); /* c is double word aligned */ . . . See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for a com- plete discussion of the –ms, –o3, and –pm options, the _nassert intrinsic, and the MUST_ITERATE and PROB_ITERATE pragmas.
  • Page 110: Word-Aligned Vector Sum

    Refining C/C++ Code The performance of a software pipeline is limited by the number of resources that can execute in parallel. In its word-aligned form (Example 3–27), the vec- tor sum loop delivers two results every two cycles because the two loads and the store are all operating on two 16-bit values at a time.
  • Page 111: Fir_Type2—Original Form

    In general unrolling may be a good idea if you have an uneven partition or if your loop carried dependency bound is greater than the partition bound. (Refer to section 6.7, Loop Carry Paths and section 3.2 in the TMS320C6000 Opti- mizing C/C++ Compiler User’s Guide . This information can be obtained by us- ing the –mw option and looking at the comment block before the loop.
  • Page 112: Fir_Type2—Inner Loop Completely Unrolled

    Refining C/C++ Code Example 3–30. FIR_Type2—Inner Loop Completely Unrolled void fir2_u(const short input[restrict], const short coefs[restrict], short out[restrict]) int i, j; int sum; for (i = 0; i < 40; i++) sum = coefs[0] * input[i + 15]; sum += coefs[1] * input[i + 14]; sum += coefs[2] * input[i + 13];...
  • Page 113: Vector Sum

    Out of Interruptible Code . If the compiler does not automatically unroll the loop, you can suggest that the compiler unroll the loop by using the UNROLL pragma. See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more informa- tion. 3-48...
  • Page 114 Thus, the user assumes responsibility for safety. For a complete discussion of the -mh option, including how to use it safely, see the TMS320C6000 Optimizing C/C++ Compiler User’s Guide . 3.4.3.6 What Disqualifies a Loop from Being Software-Pipelined In a sequence of nested loops, the innermost loop is the only one that can be software-pipelined.
  • Page 115: Use Of If Statements In Float Collision Detection (Original Code)

    Refining C/C++ Code In the loop in Example 3–32, there is an early exit. If dist0 or dist1 is less than distance, then execution breaks out of the loop early. If the compiler could not perform transformations to the loop to software pipeline the loop, you would have to modify the code.
  • Page 116: Use Of If Statements In Float Collision Detection (Modified Code)

    Refining C/C++ Code Example 3–33. Use of If Statements in Float Collision Detection (Modified Code) int colldet_new(const float *restrict x, const float *restrict p, float point, float distance) int I, retval = 0; float sum0, sum1, dist0, dist1; for (I = 0; I < (28 * 3); I += 6) sum0 = x[I+0]*p[0] + x[I+1]*p[1] + x[I+2]*p[2];...
  • Page 117: Linking Issues

    Chapter 4 Linking Issues This chapter contains useful information about other problems and questions that might arise while building your projects, including: What to do with the relocation value truncated linker and assembler mes- sages How to save on-chip memory by moving the RTS off-chip How to build your application with RTS calls either near or far How to change the default RTS data from far to near Topic...
  • Page 118: How To Use Linker Error Messages

    Edit the resulting .lst file, in this case file.lst. Each line in the assembly listing has several fields. For a full description of those fields see section 3.10 of the TMS320C6000 Assembly Language Tools User’s Guide . The field you are interested in here is the second one, the section program counter (SPC) field.
  • Page 119 –ml n memory model option to automatically declare ary and other such data objects to be far. See chapter 2 of the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more information on –ml n .
  • Page 120: Referencing Far Global Objects Defined In Other Files

    How to Use Linker Error Messages Example 4–1. Referencing Far Global Objects Defined in Other Files <file1.c> /* Define ary to be a global variable not accessible via the data page */ /* pointer. far int ary;... <file2.c> /* In order for the code in file2.c to access ary correctly, it must be */ /* defined as ’extern far’.
  • Page 121: Executable Flag

    How to Use Linker Error Messages 4.1.2 Executable Flag You may also see the linker message: >> warning: output file file.out not executable If this is due solely to MVK instructions, paired with MVKH, which have yet to be changed to MVKL, then this warning may safely be ignored. The loaders supplied by TI will still load and execute this .out file.
  • Page 122: Command Line Options For Rts Calls

    How to Save On-Chip Memory by Placing RTS Off-Chip 4.2 How to Save On-Chip Memory by Placing RTS Off-Chip One of many techniques you might use to save valuable on-chip space is to place the code and data needed by the runtime-support (RTS) functions in off- chip memory.
  • Page 123: Must #Include Header Files

    How to Save On-Chip Memory by Placing RTS Off-Chip to RTS functions to be near, regardless of the setting of the –ml n switch. This option is for special situations, and typically isn’t needed. The option –mr1 will cause calls to RTS functions to be far, regardless of the setting of the –ml n switch.
  • Page 124: How To Link

    How to Save On-Chip Memory by Placing RTS Off-Chip 4.2.4 How to Link You place the RTS code and data in off-chip memory through the linking pro- cess. Here is an example linker command file you could use instead of the lnk.cmd file provided in the lib directory.
  • Page 125 How to Save On-Chip Memory by Placing RTS Off-Chip /*–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––*/ /* RTS code – placed off chip /*–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––*/ .rtstext { –lrts6200.lib(.text) } > EXT0 /*–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––*/ /* RTS data – undefined sections – placed off chip /*–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––*/ .rtsbss { –lrts6200.lib(.bss) –lrts6200.lib(.far) } >...
  • Page 126: Example Compiler Invocation

    How to Save On-Chip Memory by Placing RTS Off-Chip The .rtsdata section combines all of the defined data sections together. De- fined data sections both reserve and initialize the contents of a section. You use the .sect assembler directive to create defined sections. It is necessary to build and allocate the undefined data sections separately from the defined data sections.
  • Page 127: How _Far_Rts Is Defined In Linkage.h With -Mr

    How to Save On-Chip Memory by Placing RTS Off-Chip Refer to section 4.4.1 to learn about the linker error messages when calls go beyond the PC relative boundary. 4.2.6 Header File Details Look at the file linkage.h in the include directory of the release. Depending on the value of the _FAR_RTS macro, the macro _CODE_ACCESS is set to force calls to RTS functions to be either user default, near, or far.
  • Page 128 How to Save On-Chip Memory by Placing RTS Off-Chip if you want RTS data access to use the same method used when accessing ordinary user data. Copy linkage.h to the lib directory. Go to the lib directory. Replace the linkage.h entry in the source library: ar6x –r rts.src linkage.h Delete linkage.h.
  • Page 129: Structure Of Assembly Code

    Chapter 5 Structure of Assembly Code An assembly language program must be an ASCII text file. Any line of assembly code can include up to seven items: Label Parallel bars Conditions Instruction Functional unit Operands Comment Topic Page Labels ............Parallel Bars .
  • Page 130: Labels In Assembly Code

    Labels / Parallel Bars 5.1 Labels A label identifies a line of code or a variable and represents a memory address that contains either an instruction or data. Figure 5–1 shows the position of the label in a line of assembly code. The colon following the label is optional.
  • Page 131: Conditions In Assembly Code

    Conditions 5.3 Conditions Five registers on the ’C62x/’C67x are available for conditions: A1, A2, B0, B1, and B2. Six registers on the ’C64x are available for conditions: A0, A1, A2, B0, B1, and B2. Figure 5–3 shows the position of a condition in a line of assembly code.
  • Page 132: Instructions In Assembly Code

    .short value Reserve 16 bits in memory and fill with specified value value .half .byte value Reserve 8 bits in memory and fill with specified value See the TMS320C6000 Assembly Language Tools User’s Guide for a com- plete list of directives.
  • Page 133: Tms320C6X Functional Units

    Functional Units 5.5 Functional Units The ’C6000 CPU contains eight functional units, which are shown in Figure 5–5 and described in Table 5–2. Figure 5–5. TMS320C6x Functional Units Register Register file A file B Memory Structure of Assembly Code...
  • Page 134: Functional Units And Operations Performed

    Functional Units Table 5–2. Functional Units and Operations Performed Functional Unit Fixed–Point Operations Floating–Point Operations .L unit (.L1, .L2) 32/40-bit arithmetic and compare Arithmetic operations operations SP, INT DP, INT 32-bit logical operations conversion operations Leftmost 1 or 0 counting for 32 bits Normalization count for 32 and 40 bits Byte shifts Data packing/unpacking...
  • Page 135: Units In The Assembly Code

    Functional Units Table 5–2. Functional Units and Operations Performed (Continued) Functional Unit Fixed–Point Operations Floating–Point Operations .M unit (.M1, .M2) 16 x 16 multiply operations 32 X 32–bit fixed–point multiply operations Floating–point multiply operations 16 x 32 multiply operations Quad 8 x 8 multiply operations Dual 16 x 16 multiply operations Dual 16 x 16 multiply with add/subtract operations...
  • Page 136: Operands In The Assembly Code

    When an operand comes from the other register file, the unit includes an X, as shown in Figure 5–8, indicating that the instruction is using one of the cross paths. (See the TMS320C6000 CPU and Instruction Set Reference Guide for more information on cross paths.) Figure 5–8.
  • Page 137: Comments

    Comments 5.7 Comments As with all programming languages, comments provide code documentation. Figure 5–9 shows the position of the comment in a line of assembly code. Figure 5–9. Comments in Assembly Code label: parallel bars [condition] instruction unit operands ; comments The following are guidelines for using comments in assembly code: A comment may begin in any column when preceded by a semicolon (;).
  • Page 138 Chapter 6 Optimizing Assembly Code via Linear Assembly This chapter describes methods that help you develop more efficient assembly language programs, understand the code produced by the assembly optimizer, and perform manual optimization. This chapter encompasses phase 3 of the code development flow. After you have developed and optimized your C code using the ’C6000 compiler, extract the inefficient areas from your C code and rewrite them in linear assembly (as- sembly code that has not been register-allocated and is unscheduled).
  • Page 139: Assembly Code

    Although you have the option with the ’C6000 to specify the functional unit or register used, this may restrict the compiler’s ability to fully optimize your code. See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more in- formation. This chapter takes you through the optimization process manually to show you how the assembly optimizer works and to help you understand when you might want to perform some of the optimizations manually.
  • Page 140 Assembly Code Each example discusses the: Algorithm in C code Translation of the C code to linear assembly Dependency graph to describe the flow of data in the algorithm Allocation of resources (functional units, registers, and cross paths) in lin- ear assembly Note: There are three types of code for the ’C6000: C/C++ code (which is input for...
  • Page 141: Assembly Optimizer Options And Directives

    Assembly Optimizer Options and Directives 6.2 Assembly Optimizer Options and Directives All directives and options that are described in the following sections are listed in greater detail in Chapter 4 of the TMS320C6000 Optimizing C/C++ Compil- er User’s Guide . 6.2.1 The –o n Option...
  • Page 142: The .Mdep Directive

    Assembly Optimizer Options and Directives For a full description on the implications of .no_mdep and the -mt option, refer to Appendix B, Memory Alias Disambiguation. Refer to the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more information on both the -mt option and the .no_mdep directive.
  • Page 143 Assembly Optimizer Options and Directives Example 6–3.Linear Assembly Dot Product (Continued) *ptr_a++, val1 *ptr_b++, val2 val1, val2, prod1 sum1, prod1, sum1 *ptr_a++, val1 *ptr_b++, val2 val3, val4, prod2 sum2, prod2, sum2 [cnt] add –1, cnt, cnt [cnt] b loop sum1, sum2, sum1 return sum1 .endproc <loop kernel generated>...
  • Page 144: Linear Assembly Dot Product With .Mptr

    Assembly Optimizer Options and Directives Example 6–4. Linear Assembly Dot Product With .mptr dotp: .cproc ptr_a, ptr_b, cnt .reg val1, val2, val3, val4 .reg prod1, prod2, sum1, sum2 zero sum1 zero sum2 .mptr ptr_a, x, 4 .mptr ptr_b, x, 4 loop: .trip 20, 20 *ptr_a++, val1 *ptr_b++, val2...
  • Page 145: The .Trip Directive

    Assembly Optimizer Options and Directives The above loop kernel has no memory bank conflicts in the case where ptr_a and ptr_b point to the same bank. This means that you have to know how your data is aligned in C code before using the .mptr directive in your linear assem- bly code.
  • Page 146: Writing Parallel Code

    Writing Parallel Code 6.3 Writing Parallel Code One way to optimize linear assembly code is to reduce the number of execu- tion cycles in a loop. You can do this by rewriting linear assembly instructions so that the final assembly instructions execute in parallel. 6.3.1 Dot Product C Code The dot product is a sum in which each element in array a is multiplied by the...
  • Page 147: Translating C Code To Linear Assembly

    Writing Parallel Code 6.3.2 Translating C Code to Linear Assembly The first step in optimizing your code is to translate the C code to linear assem- bly. 6.3.2.1 Fixed-Point Dot Product Example 6–7 shows the linear assembly instructions used for the inner loop of the fixed-point dot product C code.
  • Page 148: Linear Assembly Resource Allocation

    Writing Parallel Code accumulates the total of the results from the multiply (MPYSP) instruction. The subtract (SUB) instruction decrements the loop counter. An additional instruction is included to execute the branch back to the top of the loop. The branch (B) instruction is conditional on the loop counter, A1, and executes only until A1 is 0.
  • Page 149: Dependency Graph Of Fixed-Point Dot Product

    Writing Parallel Code Use the following steps to draw a dependency graph: 1) Define the nodes based on the variables accessed by the instructions. 2) Define the data paths that show the flow of data between nodes. 3) Add the instructions and the latencies. 4) Add the functional units.
  • Page 150: Dependency Graph Of Floating-Point Dot Product

    Writing Parallel Code The dependency graph for this dot product algorithm has two separate parts because the decrement of the loop counter and the branch do not read or write any variables from the other part. The SUB instruction writes to the loop counter, cntr. The output of the SUB instruction feeds back and creates a loop carry path.
  • Page 151: Nonparallel Versus Parallel Assembly Code

    Writing Parallel Code The dependency graph for this dot product algorithm has two separate parts because the decrement of the loop counter and the branch do not read or write any variables from the other part. The SUB instruction writes to the loop counter, cntr. The output of the SUB instruction feeds back and creates a loop carry path.
  • Page 152: Dependency Graph Of Fixed-Point Dot Product With Parallel Assembly

    Writing Parallel Code Figure 6–3. Dependency Graph of Fixed-Point Dot Product with Parallel Assembly .M1X LOOP Example 6–10. Parallel Assembly Code for Fixed-Point Dot Product 100, A1 ; set up loop counter ZERO ; zero out accumulator LOOP: *A4++,A2 ; load ai from memory *B4++,B2 ;...
  • Page 153: Nonparallel Assembly Code For Floating-Point Dot Product

    Writing Parallel Code Rearranging the order of the instructions also improves the performance of the code. The SUB instruction can take the place of one of the NOP delay slots for the LDH instructions. Moving the B instruction after the SUB removes the need for the NOP 5 used at the end of the code in Example 6–9.
  • Page 154: Dependency Graph Of Floating-Point Dot Product With Parallel Assembly

    Writing Parallel Code Figure 6–4. Dependency Graph of Floating-Point Dot Product with Parallel Assembly MPYSP .M1X ADDSP LOOP Example 6–12. Parallel Assembly Code for Floating-Point Dot Product 100, A1 ; set up loop counter ZERO ; zero out accumulator LOOP: *A4++,A2 ;...
  • Page 155: Comparison Of Nonparallel And Parallel Assembly Code For Fixed-Point Dot Product

    Writing Parallel Code Rearranging the order of the instructions also improves the performance of the code. The SUB instruction replaces one of the NOP delay slots for the LDW instructions. Moving the B instruction after the SUB removes the need for the NOP 5 used at the end of the code in Example 6–11 on page 6-16.
  • Page 156: Using Word Access For Short Data And Doubleword Access For Floating-Point Data

    + 1] at the same time and load both into a register pair. (The data must be doubleword-aligned in memory.) See the TMS320C6000 CPU and In- struction Set User’s Guide for more specific information on the LDDW instruc- tion.
  • Page 157: Translating C Code To Linear Assembly

    Using symbolic names for data and pointers makes code easier to write and allows the optimizer to allocate registers. However, you must use the .reg assembly optimizer directive. See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more information on writing linear assembly code.
  • Page 158: Linear Assembly For Floating-Point Dot Product Inner Loop With Lddw

    Using symbolic names for data and pointers makes code eas- ier to write and allows the optimizer to allocate registers. However, you must use the .reg assembly optimizer directive. See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more information on writing linear assembly code.
  • Page 159: Drawing A Dependency Graph

    Using Word Access for Short Data and Doubleword Access for Floating-Point Data 6.4.3 Drawing a Dependency Graph The dependency graph in Figure 6–5 for the fixed-point dot product shows that the LDW instructions are parents of the MPY instructions and the MPY instruc- tions are parents of the ADD instructions.
  • Page 160: Linear Assembly Resource Allocation

    Using Word Access for Short Data and Doubleword Access for Floating-Point Data LDDWs, MPYSPs, and ADDSPs on each side. To keep both sides even, place the remaining two instructions, B and SUB, on opposite sides. Figure 6–6. Dependency Graph of Floating-Point Dot Product With LDDW A side B side LDDW...
  • Page 161: Dependency Graph Of Fixed-Point Dot Product With Ldw (Showing Functional Units)

    Using Word Access for Short Data and Doubleword Access for Floating-Point Data Figure 6–7. Dependency Graph of Fixed-Point Dot Product With LDW (Showing Functional Units) A side B side bi & bi+1 ai & ai+1 MPYH pi+1 .M1X .M2X sum0 sum1 cntr LOOP...
  • Page 162: Dependency Graph Of Floating-Point Dot Product With Lddw (Showing Functional Units)

    Using Word Access for Short Data and Doubleword Access for Floating-Point Data Figure 6–8. Dependency Graph of Floating-Point Dot Product With LDDW (Showing Functional Units) A side B side LDDW LDDW bi & bi+1 ai & ai+1 MPYSP MPYSP pi+1 .M1X .M2X ADDSP...
  • Page 163: (Before Software Pipelining)

    Using Word Access for Short Data and Doubleword Access for Floating-Point Data 6.4.5 Final Assembly Example 6–19 shows the final assembly code for the unrolled loop of the fixed- point dot product and Example 6–20 shows the final assembly code for the unrolled loop of the floating-point dot product.
  • Page 164: (Before Software Pipelining)

    Using Word Access for Short Data and Doubleword Access for Floating-Point Data 6.4.5.2 Floating-Point Dot Product Example 6–20 uses LDDW instructions instead of LDW instructions. Example 6–20. Assembly Code for Floating-Point Dot Product With LDDW (Before Software Pipelining) 50,A1 ; set up loop counter ZERO ;...
  • Page 165: Comparing Performance

    Using Word Access for Short Data and Doubleword Access for Floating-Point Data 6.4.6 Comparing Performance Executing the fixed-point dot product with the optimizations in Example 6–19 requires only 50 iterations, because you operate in parallel on both the even and odd array elements. With the setup code and the final ADD instruction, 100 iterations of this loop require a total of 402 cycles (1 + 8 50 + 1).
  • Page 166: Software Pipelining

    Software Pipelining 6.5 Software Pipelining This section describes the process for improving the performance of the as- sembly code in the previous section through software pipelining . Software pipelining is a technique used to schedule instructions from a loop so that multiple iterations execute in parallel. The parallel resources on the ’C6x make it possible to initiate a new loop iteration before previous iterations finish.
  • Page 167: Dependency Graph Of Fixed-Point Dot Product With Ldw (Showing Functional Units)

    Software Pipelining Figure 6–9. Dependency Graph of Fixed-Point Dot Product With LDW (Showing Functional Units) A side B side bi & bi+1 ai & ai+1 MPYH pi+1 .M1X .M2X sum0 sum1 cntr LOOP Example 6–21. Linear Assembly for Fixed-Point Dot Product Inner Loop (With Conditional SUB Instruction) *A4++,A2 ;...
  • Page 168: Dependency Graph Of Floating-Point Dot Product With Lddw (Showing Functional Units)

    Software Pipelining Figure 6–10. Dependency Graph of Floating-Point Dot Product With LDDW (Showing Functional Units) A side B side LDDW LDDW bi & bi+1 ai & ai+1 MPYSP MPYSP pi+1 .M1X .M2X ADDSP ADDSP sum0 sum1 cntr LOOP Example 6–22. Linear Assembly for Floating-Point Dot Product Inner Loop (With Conditional SUB Instruction) LDDW *A4++,A2...
  • Page 169: Modulo Iteration Interval Scheduling

    Software Pipelining 6.5.1 Modulo Iteration Interval Scheduling Another way to represent the performance of the code is by looking at it in a modulo iteration interval scheduling table. This table shows how a software-pipelined loop executes and tracks the available resources on a cycle-by-cycle basis to ensure that no resource is used twice on any given cycle.
  • Page 170: Modulo Iteration Interval Scheduling Table For Floating-Point Dot Product

    Software Pipelining 6.5.1.2 Floating-Point Example The floating-point code in Example 6–20 needs ten cycles for each iteration of the loop, so the iteration interval is ten. Table 6–6 shows a modulo iteration interval scheduling table for the floating- point dot product loop before software pipelining (Example 6–20). Each row represents a functional unit.
  • Page 171 Software Pipelining 6.5.1.3 Determining the Minimum Iteration Interval Software pipelining increases performance by using the resources more effi- ciently. However, to create a fully pipelined schedule, it is helpful to first deter- mine the minimum iteration interval . The minimum iteration interval of a loop is the minimum number of cycles you must wait between each initiation of successive iterations of that loop.
  • Page 172: Modulo Iteration Interval Table For Fixed-Point Dot Product (After Software Pipelining)

    Software Pipelining Table 6–7. Modulo Iteration Interval Table for Fixed-Point Dot Product (After Software Pipelining) Loop Prolog Unit / Cycle 7, 8, 9... ******* **** ***** ****** ******* **** ***** ****** MPYH MPYH MPYH ****** **** ***** ***** **** Note: The asterisks indicate the iteration of the loop;...
  • Page 173: Modulo Iteration Interval Table For Floating-Point Dot Product (After Software Pipelining)

    Software Pipelining Floating-Point Example Table 6–8 shows a fully pipelined schedule for the floating-point dot product example. Table 6–8. Modulo Iteration Interval Table for Floating-Point Dot Product (After Software Pipelining) Loop Prolog Unit / 9, 10, 11... Cycle ********* **** ***** ****** *******...
  • Page 174: Pseudo-Code For Single-Cycle Accumulator With Addsp

    Software Pipelining Note: Since the ADDSP instruction has three delay slots associated with it, the re- sults of adding are staggered by four. That is, the first result from the ADDSP is added to the fifth result, which is then added to the ninth, and so on. The second result is added to the sixth, which is then added to the 10th.
  • Page 175: Software Pipeline Accumulation Staggered Results Due To Three-Cycle Delay

    Software Pipelining Table 6–9. Software Pipeline Accumulation Staggered Results Due to Three-Cycle Delay Current value of Cycle # Pseudoinstruction Written expected result pseudoregister sum ADDSP x(0), sum, sum ; cycle 4 sum = x(0) ADDSP x(1), sum, sum ; cycle 5 sum = x(1) ADDSP x(2), sum, sum ;...
  • Page 176: Using The Assembly Optimizer To Create Optimized Loops

    You can use this code as input to the assembly optimiz- er tool to create software-pipelined loops automatically. See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more informa- tion on the assembly optimizer. Example 6–24. Linear Assembly for Full Fixed-Point Dot Product .global _dotp...
  • Page 177: Final Assembly

    Table 6–7 and Table 6–8, re- spectively. Note: All instructions executing in parallel constitute an execute packet. An exe- cute packet can contain up to eight instructions. See the TMS320C6000 CPU and Instruction Set Reference Guide for more information about pipeline operation. 6-40...
  • Page 178 Software Pipelining 6.5.3.1 Fixed-Point Example Multiple branch instructions are in the pipe. The first branch in the fixed-point dot product is issued on cycle 2 but does not actually branch until the end of cycle 7 (after five delay slots). The branch target is the execute packet defined by the label LOOP.
  • Page 179: Assembly Code For Fixed-Point Dot Product (Software Pipelined)

    Software Pipelining Example 6–26. Assembly Code for Fixed-Point Dot Product (Software Pipelined) *A4++,A2 ; load ai & ai+1 from memory *B4++,B2 ; load bi & bi+1 from memory 50,A1 ; set up loop counter ZERO ; zero out sum0 accumulator ZERO ;...
  • Page 180: Assembly Code For Floating-Point Dot Product (Software Pipelined)

    Software Pipelining 6.5.3.2 Floating-Point Example The first branch in the floating-point dot product is issued on cycle 4 but does not actually branch until the end of cycle 9 (after five delay slots). The branch target is the execute packet defined by the label LOOP. On cycle 9, the first branch returns to the same execute packet, resulting in a single-cycle loop.
  • Page 181 Software Pipelining Example 6–27. Assembly Code for Floating-Point Dot Product (Software Pipelined) (Continued) LDDW A4++,A7:A6 ;******* load ai & ai + 1 from memory LDDW B4++,B7:B6 ;******* load bi & bi + 1 from memory MPYSP .M1X A6,B6,A5 ;** pi = a0 MPYSP .M2X A7,B7,B5...
  • Page 182 Software Pipelining 6.5.3.3 Removing Extraneous Instructions The code in Example 6–26 and Example 6–27 executes extra iterations of some of the instructions in the loop. The following operations occur in parallel on the last cycle of the loop in Example 6–26: Iteration 50 of the ADD instructions Iteration 52 of the MPY and MPYH instructions Iteration 57 of the LDW instructions...
  • Page 183 Software Pipelining ADDSPs all execute exactly 50 times. (The shaded areas of Example 6–29 in- dicate the changes in this code.) Executing the dot product code in Example 6–29 with no extraneous LDDWs still requires a total of 74 cycles (9 + 41 + 9 + 15), but the code size is now larg- Example 6–28.
  • Page 184 Software Pipelining Example 6–28. Assembly Code for Fixed-Point Dot Product (Software Pipelined With No Extraneous Loads) (Continued) LOOP: A6,A7,A7 ; sum0 += (ai * bi) B6,B7,B7 ; sum1 += (ai+1 * bi+1) .M1X A2,B2,A6 ;** ai * bi MPYH .M2X A2,B2,B6 ;** ai+1 * bi+1 ||[A1] SUB...
  • Page 185 Software Pipelining Example 6–29. Assembly Code for Floating-Point Dot Product (Software Pipelined With No Extraneous Loads) 41,A1 ; set up loop counter ZERO ; sum0 = 0 ZERO ; sum1 = 0 LDDW A4++,A7:A6 ; load ai & ai + 1 from memory LDDW B4++,B7:B6 ;...
  • Page 186 Software Pipelining Example 6–29. Assembly Code for Floating-Point Dot Product (Software Pipelined With No Extraneous Loads) (Continued LOOP: LDDW A4++,A7:A6 ;********* load ai & ai + 1 from memory LDDW B4++,B7:B6 ;********* load bi & bi + 1 from memory MPYSP .M1X A6,B6,A5...
  • Page 187 Software Pipelining Example 6–29. Assembly Code for Floating-Point Dot Product (Software Pipelined With No Extraneous Loads) (Continued) ADDSP .L1X A8,B8,A0 ; sum(0) = sum0(0) + sum1(0) ADDSP .L2X A8,B8,B0 ; sum(1) = sum0(1) + sum1(1) ADDSP .L1X A8,B8,A0 ; sum(2) = sum0(2) + sum1(2) ADDSP .L2X A8,B8,B0...
  • Page 188 Software Pipelining 6.5.3.4 Priming the Loop Although Example 6–28 and Example 6–29 execute as fast as possible, the code size can be smaller without significantly sacrificing performance. To help reduce code size, you can use a technique called priming the loop . Assuming that you can handle extraneous loads, start with Example 6–26 or Example 6–27, which do not have epilogs and, therefore, contain fewer instructions.
  • Page 189 Software Pipelining Example 6–30. Assembly Code for Fixed-Point Dot Product (Software Pipelined With Removal of Prolog and Epilog) 57,A1 ; set up loop counter [A1] SUB A1,1,A1 ; decrement loop counter ZERO ; zero out sum0 accumulator ZERO ; zero out sum1 accumulator [A1] SUB A1,1,A1 ;* decrement loop counter...
  • Page 190 Software Pipelining Floating-Point Example To eliminate the prolog of the floating-point dot product and, therefore, the extra LDDW and MPYSP instructions, begin execution at the loop body (at the LOOP label). Eliminating the prolog means that: Two LDDWs, two MPYSPs, and two ADDSPs occur in the first execution cycle of the loop.
  • Page 191 Software Pipelining Example 6–31. Assembly Code for Floating-Point Dot Product (Software Pipelined With Removal of Prolog and Epilog) (Continued) [A1] LOOP ;*** branch to loop ||[A1] A1,1,A1 ;**** decrement loop counter [A1] LOOP ;**** branch to loop ||[A1] A1,1,A1 ;***** decrement loop counter LOOP: LDDW A4++,A7:A6...
  • Page 192 Software Pipelining 6.5.3.5 Removing Extra SUB Instructions To reduce code size further, you can remove extra SUB instructions. If you know that the loop count is at least 6, you can eliminate the extra SUB instruc- tions as shown in Example 6–32 and Example 6–33. The first five branch instructions are made unconditional, because they always execute.
  • Page 193 Software Pipelining Example 6–33. Assembly Code for Floating-Point Dot Product (Software Pipelined With Smallest Code Size) LOOP ; branch to loop 53,A1 ; set up loop counter LOOP ;* branch to loop ZERO ; zero out mpysp input ZERO ; zero out mpysp input LOOP ;** branch to loop ZERO...
  • Page 194: Comparing Performance

    Software Pipelining 6.5.4 Comparing Performance Table 6–10 compares the performance of all versions of the fixed-point dot product code. Table 6–11 compares the performance of all versions of the floating-point dot product code. Table 6–10. Comparison of Fixed-Point Dot Product Code Examples Code Example 100 Iterations Cycle Count...
  • Page 195: Modulo Scheduling Of Multicycle Loops

    Modulo Scheduling of Multicycle Loops 6.6 Modulo Scheduling of Multicycle Loops Section 6.5 demonstrated the modulo-scheduling technique for the dot product code. In that example of a single-cycle loop, none of the instructions used the same resources. Multicycle loops can present resource conflicts which affect modulo scheduling.
  • Page 196: Determining The Minimum Iteration Interval

    Modulo Scheduling of Multicycle Loops 6.6.3 Determining the Minimum Iteration Interval Example 6–35 includes three memory operations in the inner loop (two LDHs and the STH) that must each use a .D unit. Only two .D units are available on any single cycle;...
  • Page 197: Linear Assembly For Weighted Vector Sum Using Ldw

    Modulo Scheduling of Multicycle Loops 6.6.3.2 Translating Unrolled Inner Loop to Linear Assembly Example 6–37 shows the linear assembly that calculates c[i] and c[i+1] for the weighted vector sum in Example 6–36. The two store pointers (*ciptr and *ci+1ptr) are separated so that one (*ciptr) increments by 2 through the odd elements of the array and the other (*ci+1ptr) increments through the even elements.
  • Page 198: Dependency Graph Of Weighted Vector Sum

    Modulo Scheduling of Multicycle Loops 6.6.4 Drawing a Dependency Graph To achieve a minimum iteration interval of 2, you must put an equal number of operations per unit on each side of the dependency graph. Three operations in one unit on a side would result in an minimum iteration interval of 3. Figure 6–11 shows the dependency graph divided evenly with a minimum it- eration interval of 2.
  • Page 199: Linear Assembly Resource Allocation

    Modulo Scheduling of Multicycle Loops 6.6.5 Linear Assembly Resource Allocation Using the dependency graph, you can allocate functional units and registers as shown in Example 6–38. This code is based on the following assumptions: The pointers are initialized outside the loop. m resides in B6, which causes both .M units to use a cross path.
  • Page 200 Modulo Scheduling of Multicycle Loops Only seven instructions have been scheduled in this table. The two LDWs use the .D units on the even cycles. The MPY and MPYH are scheduled on cycle 5 because the LDW has four delay slots. The MPY instructions appear in two rows because they use the .M and cross path resources on cycles 5, 7, 9, etc.
  • Page 201: Modulo Iteration Interval Table For Weighted Vector Sum (2-Cycle Loop)

    Modulo Scheduling of Multicycle Loops Table 6–12. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) Unit/Cycle **** ***** LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 **** ***** LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1...
  • Page 202: Dependency Graph Of Weighted Vector Sum (Showing Resource Conflict)

    Modulo Scheduling of Multicycle Loops 6.6.6.1 Resource Conflicts Resources from one instruction cannot conflict with resources from any other instruction scheduled modulo iteration intervals away. In other words, for a 2-cycle loop, instructions scheduled on cycle n cannot use the same resources as instructions scheduled on cycles n + 2, n + 4, n + 6, etc.
  • Page 203: Modulo Iteration Interval Table For Weighted Vector Sum With Shr Instructions

    Modulo Scheduling of Multicycle Loops Table 6–13. Modulo Iteration Interval Table for Weighted Vector Sum With SHR Instructions Unit / Cycle 10, 12, 14, ... **** ***** LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 **** ***** LDW bi_i+1 LDW bi_i+1...
  • Page 204 Modulo Scheduling of Multicycle Loops 6.6.6.2 Live Too Long Scheduling SHR bi+1 on cycle 6 now creates a problem with scheduling the ADD ci instruction. The parents of ADD ci (AND bi and SHR pi_scaled) are scheduled on cycles 5 and 7, respectively. Because the SHR pi_scaled is scheduled on cycle 7, the earliest you can schedule ADD ci is cycle 8.
  • Page 205: Dependency Graph Of Weighted Vector Sum (With Resource Conflict Resolved)

    Modulo Scheduling of Multicycle Loops Figure 6–13. Dependency Graph of Weighted Vector Sum (With Resource Conflict Resolved) A side B side ai_i+1 MPYHL pi+1 bi_i+1 pi_scaled pi+1_scaled bi+1 ci+1 cntr LOOP Note: Shaded numbers indicate the cycle in which the instruction is first scheduled. 6-68...
  • Page 206: Modulo Iteration Interval Table For Weighted Vector Sum (2-Cycle Loop)

    Modulo Scheduling of Multicycle Loops Table 6–14. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) Unit/Cycle **** ***** LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 **** ***** LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1...
  • Page 207: Dependency Graph Of Weighted Vector Sum (Scheduling Ci +1)

    Modulo Scheduling of Multicycle Loops 6.6.6.4 Scheduling the Remaining Instructions Figure 6–14 shows the dependency graph with additional scheduling changes. The final version of the loop, with all instructions scheduled correctly, is shown in Table 6–15. Figure 6–14. Dependency Graph of Weighted Vector Sum (Scheduling ci +1) A side B side ai_i+1...
  • Page 208 Modulo Scheduling of Multicycle Loops Table 6–15 shows the following additions: B LOOP (.S1, cycle 6) SUB cntr (.L1, cycle 5) ADD ci+1 (.L2, cycle 10) STH ci (cycle 9) STH ci+1 (cycle 11) To avoid resource conflicts and live-too-long problems, Table 6–15 also includes the following additional changes: LDW bi_i+1 (.D2) moved from cycle 0 to cycle 2.
  • Page 209: Modulo Iteration Interval Table For Weighted Vector Sum (2-Cycle Loop)

    Modulo Scheduling of Multicycle Loops Table 6–15. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) Unit/Cycle 10, 12, 14, ... **** ***** LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 **** LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1...
  • Page 210: Using The Assembly Optimizer For The Weighted Vector Sum

    Modulo Scheduling of Multicycle Loops 6.6.7 Using the Assembly Optimizer for the Weighted Vector Sum Example 6–39 shows the linear assembly code to perform the weighted vector sum. You can use this code as input to the assembly optimizer to create a soft- ware-pipelined loop instead of scheduling this by hand.
  • Page 211: Final Assembly

    Modulo Scheduling of Multicycle Loops 6.6.8 Final Assembly Example 6–40 shows the final assembly code for the weighted vector sum. The following optimizations are included: While iteration n of instruction STH ci+1 is executing, iteration n + 1 of STH ci is executing. To prevent the STH ci instruction from executing itera- tion 51 while STH ci + 1 executes iteration 50, execute the loop only 49 times and schedule the final executions of ADD ci+1 and STH ci+1 after exiting the loop.
  • Page 212: Assembly Code For Weighted Vector Sum

    Modulo Scheduling of Multicycle Loops Example 6–40. Assembly Code for Weighted Vector Sum *A4++,A2 ; ai & ai+1 .L2X A6,2,B0 ; set pointer to ci+1 *B4++,B2 ; bi & bi+1 *A4++,A2 ;* ai & ai+1 –1,B10 ; set to all 1s (0xFFFFFFFF) *B4++,B2 ;* bi &...
  • Page 213 Modulo Scheduling of Multicycle Loops Example 6–40. Assembly Code for Weighted Vector Sum (Continued) B9,*B0++[2] ; store ci+1 B5,15,B7 ;* (m * ai+1) >> 15 A9,*A6++[2] ;* store ci A5,15,A7 ;** (m * ai) >> 15 B2,B10,B8 ;** bi ||[A1] SUB A1,1,A1 ;*** decrement loop counter .M1X...
  • Page 214: Loop Carry Paths

    Loop Carry Paths 6.7 Loop Carry Paths Loop carry paths occur when one iteration of a loop writes a value that must be read by a future iteration. A loop carry path can affect the performance of a software-pipelined loop that executes multiple iterations in parallel. Some- times loop carry paths (instead of resources) determine the minimum iteration interval.
  • Page 215: Translating C Code To Linear Assembly (Inner Loop)

    Loop Carry Paths 6.7.2 Translating C Code to Linear Assembly (Inner Loop) Example 6–42 shows the ’C6000 instructions that execute the inner loop of the IIR filter C code. In this example: xptr is not postincremented after loading xi+1, because xi of the next iteration is actually xi+1 of the current iteration.
  • Page 216: Drawing A Dependency Graph

    Loop Carry Paths 6.7.3 Drawing a Dependency Graph Figure 6–15 shows the dependency graph for the IIR filter. A loop carry path exists from the store of yi+1 to the load of yi. The path between the STH and the LDH is one cycle because the load and store instructions use the same memory pipeline.
  • Page 217: Determining The Minimum Iteration Interval

    Loop Carry Paths 6.7.4 Determining the Minimum Iteration Interval To determine the minimum iteration interval, you must consider both resources and data dependency constraints. Based on resources in Table 6–16, the minimum iteration interval is 2. Note: There are six non-.M units available: three on the A side (.S1, .D1, .L1) and three on the B side (.S2, .D2, .L2).
  • Page 218: Dependency Graph Of Iir Filter (With Smaller Loop Carry)

    Loop Carry Paths Although the minimum iteration interval is the greater of the resource limits and data dependency constraints, an interval of 10 seems slow. Figure 6–16 shows how to improve the performance. 6.7.4.1 Drawing a New Dependency Graph Figure 6–16 shows a new graph with a loop carry path of 4 (2 +1 + 1). because the MPY p2 instruction can read yi+1 while it is still in a register, you can reduce the loop carry path by six cycles.
  • Page 219: Linear Assembly Resource Allocation

    Loop Carry Paths 6.7.4.2 New ’C6x Instructions (Inner Loop) Example 6–43 shows the new linear assembly from the graph in Figure 6–16, where LDH yi was removed. The one variable y that is read and written is yi for the MPY p2 instruction and yi+1 for the SHR and STH instructions. Example 6–43.
  • Page 220: Modulo Iteration Interval Table For Iir (4-Cycle Loop)

    Loop Carry Paths 6.7.6 Modulo Iteration Interval Scheduling Table 6–17 shows the modulo iteration interval table for the IIR filter. The SHR instruction on cycle 10 finishes in time for the MPY p2 instruction from the next iteration to read its result on cycle 11. Table 6–17.
  • Page 221: Using The Assembly Optimizer For The Iir Filter

    Loop Carry Paths 6.7.7 Using the Assembly Optimizer for the IIR Filter Example 6–45 shows the linear assembly code to perform the IIR filter. Once again, you can use this code as input to the assembly optimizer to create a soft- ware-pipelined loop instead of scheduling this by hand.
  • Page 222: Final Assembly

    Loop Carry Paths 6.7.8 Final Assembly Example 6–46 shows the final assembly for the IIR filter. With one load of y[0] outside the loop, no other loads from the y array are needed. Example 6–46 requires 408 cycles: (4 100) + 8. Example 6–46.
  • Page 223: If-Then-Else Statements In A Loop

    If-Then-Else Statements in a Loop 6.8 If-Then-Else Statements in a Loop If-then-else statements in C cause certain instructions to execute when the if condition is true and other instructions to execute when it is false. One way to accomplish this in linear assembly code is with conditional instructions. Be- cause all ’C6000 instructions can be conditional on one of five general-pur- pose registers on the ’C62x and ’C67x and one of 6 on the ’C64x.
  • Page 224: Translating C Code To Linear Assembly

    If-Then-Else Statements in a Loop 6.8.2 Translating C Code to Linear Assembly Example 6–48 shows the linear assembly instructions needed to execute in- ner loop of the C code in Example 6–47. Example 6–48. Linear Assembly for If-Then-Else Inner Loop codeword,mask,cond ;...
  • Page 225: Drawing A Dependency Graph

    If-Then-Else Statements in a Loop 6.8.3 Drawing a Dependency Graph Figure 6–17 shows the dependency graph for the if-then-else C code. This graph illustrates the following arrangement: Two nodes on the graph contain sum: one for the ADD and one for the SUB.
  • Page 226: Resource Table For If-Then-Else Code

    If-Then-Else Statements in a Loop 6.8.4 Determining the Minimum Iteration Interval With nine instructions, the minimum iteration interval is at least 2, because a maximum of eight instructions can be in parallel. Based on the way the depen- dency graph in Figure 6–17 is split, five instructions are on the A side and four are on the B side.
  • Page 227: Linear Assembly Resource Allocation

    If-Then-Else Statements in a Loop 6.8.5 Linear Assembly Resource Allocation Now that the graph is split and you know the minimum iteration interval, you can allocate functional units and registers to the instructions. You must ensure that no resource is used more than twice. Example 6–49 shows the linear assembly with the functional units and regis- ters that are used in the inner loop.
  • Page 228: Final Assembly

    If-Then-Else Statements in a Loop 6.8.6 Final Assembly Example 6–50 shows the final assembly code after software pipelining. The performance of this loop is 70 cycles (2 32 + 6). Example 6–50. Assembly Code for If-Then-Else 32,B0 ; set up loop counter [B0] ADD –1,B0,B0 ;...
  • Page 229: Comparing Performance

    If-Then-Else Statements in a Loop 6.8.7 Comparing Performance You can improve the performance of the code in Example 6–50 if you know that the loop count is at least 3. If the loop count is at least 3, remove the decre- ment counter instructions outside the loop and put the MVK (for setting up the loop counter) in parallel with the first branch.
  • Page 230: Comparison Of If-Then-Else Code Examples

    If-Then-Else Statements in a Loop Table 6–19. Comparison of If-Then-Else Code Examples Code Example Cycles Cycle Count Example 6–50 If-then-else assembly code 32) + 6 Example 6–51 If-then-else assembly code with loop count greater than 3 32) + 4 Optimizing Assembly Code via Linear Assembly 6-93...
  • Page 231: Loop Unrolling

    Loop Unrolling 6.9 Loop Unrolling Even though the performance of the previous example is good, it can be im- proved. When resources are not fully used, you can improve performance by unrolling the loop. In Example 6–52, only nine instructions execute every two cycles.
  • Page 232: Translating C Code To Linear Assembly

    Loop Unrolling 6.9.2 Translating C Code to Linear Assembly Example 6–53 shows the unrolled inner loop with 16 instructions and the possibility of achieving a loop with a minimum iteration interval of 3. Example 6–53. Linear Assembly for Unrolled If-Then-Else Inner Loop codeword,maski,condi ;...
  • Page 233: Dependency Graph Of If-Then-Else Code (Unrolled)

    Loop Unrolling 6.9.3 Drawing a Dependency Graph Although there are numerous ways to split the dependency graph, the main goal is to achieve a minimum iteration interval of 3 and meet these conditions: You cannot have more than nine non-.M instructions on either side. Only three non-.M instructions can execute per cycle.
  • Page 234: Resource Table For Unrolled If-Then-Else Code

    Loop Unrolling 6.9.4 Determining the Minimum Iteration Interval With 16 instructions, the minimum iteration interval is at least 3 because a maximum of six instructions can be in parallel with the following allocation possibilities: LDH must be on a .D unit. SHL, B, and MVK must be on a .S unit.
  • Page 235: Linear Assembly For Full Unrolled If-Then-Else Code

    Loop Unrolling Example 6–54. Linear Assembly for Full Unrolled If-Then-Else Code .global _unrolled_if_then _unrolled_if_then: .cproc a, cword, mask, theta .reg cword, mask, theta, ifi, ifi1, a, ai, ai1, cntr .reg cdi, cdi1, sumi, sumi1, sum A4,a ; C callable register for 1st operand B4,cword ;...
  • Page 236: Final Assembly

    Loop Unrolling 6.9.6 Final Assembly Example 6–55 shows the final assembly code after software pipelining. The cycle count of this loop is now 53: (3 16) + 5. Example 6–55. Assembly Code for Unrolled If-Then-Else 16,B0 ; set up loop counter *A4++,A5 ;...
  • Page 237: Comparison Of If-Then-Else Code Examples

    Loop Unrolling 6.9.7 Comparing Performance Table 6–21 compares the performance of all versions of the if-then-else code examples. Table 6–21. Comparison of If-Then-Else Code Examples Code Example Cycles Cycle Count Example 6–50 If-then-else assembly code 32) + 6 Example 6–51 If-then-else assembly code with loop count greater than 3 32) + 4 Example 6–55 Unrolled if-then-else assembly code 16) + 5...
  • Page 238: Live-Too-Long Issues

    Live-Too-Long Issues 6.10 Live-Too-Long Issues When the result of a parent instruction is live longer than the minimum iteration interval of a loop, you have a live-too-long problem. Because each instruction executes every iteration interval cycle, the next iteration of that parent over- writes the register with a new value before the child can read it.
  • Page 239: Translating C Code To Linear Assembly

    Live-Too-Long Issues 6.10.2 Translating C Code to Linear Assembly Example 6–57 shows the assembly instructions that execute the inner loop in Example 6–56. Example 6–57. Linear Assembly for Live-Too-Long Inner Loop *aptr++,ai ; load ai from memory *bptr++,bi ; load bi from memory ai,c,a0 ;...
  • Page 240: Dependency Graph Of Live-Too-Long Code

    Live-Too-Long Issues Figure 6–19. Dependency Graph of Live-Too-Long Code A side B side Split-join path Split-join path sum0 sum1 cntr LOOP Optimizing Assembly Code via Linear Assembly 6-103...
  • Page 241: Resource Table For Live-Too-Long Code

    Live-Too-Long Issues 6.10.4 Determining the Minimum Iteration Interval Table 6–22 shows the functional unit resources for the loop. Based on the re- source usage, the minimum iteration interval is 2 for the following reasons: No specific resource is used more than twice, implying a minimum itera- tion interval of 2.
  • Page 242 Live-Too-Long Issues Because a0 is written at the end of cycle 6, it must be live from cycle 7 to cycle 10, or four cycles. No value can be live longer than the minimum iteration interval, because the next iteration of the loop will overwrite that value before the current iteration can read the value.
  • Page 243: Dependency Graph Of Live-Too-Long Code (Split-Join Path Resolved)

    Live-Too-Long Issues Figure 6–20. Dependency Graph of Live-Too-Long Code (Split-Join Path Resolved) A side B side sum0 sum1 6.10.5 Linear Assembly Resource Allocation Example 6–58 shows the linear assembly code with the functional units as- signed. The choice of units for the ADDs and SUB is flexible and represents one of a number of possibilities.
  • Page 244: Linear Assembly For Full Live-Too-Long Code

    Live-Too-Long Issues Example 6–58. Linear Assembly for Full Live-Too-Long Code .global _live_long _live_long: .cproc a, b, c, d, e .reg ai, bi, sum0, sum1, sum .reg a0p, a_0, a_1, a_2, a_3, b_0, b0p, b_1, b_2, b_3, cntr 100,cntr ; cntr = 100 ZERO sum0 ;...
  • Page 245: Final Assembly With Move Instructions

    Live-Too-Long Issues 6.10.6 Final Assembly With Move Instructions Example 6–59 shows the final assembly code after software pipelining. The performance of this loop is 212 cycles (2 100 + 11 + 1). Example 6–59. Assembly Code for Live-Too-Long With Move Instructions *A4++,A0 ;...
  • Page 246 Live-Too-Long Issues Example 6–59. Assembly Code for Live-Too-Long With Move Instructions (Continued) LOOP: A7,A2,A9 ;* a3 = a2 + a0 B7,B8,B9 ;* b3 = b2 + b0 .M1X A5,B6,A7 ;* a2 = a1 * d A3,A2 ;* save a0 across iterations .M2X B5,A8,B7 ;* b2 = b1 * e...
  • Page 247: Redundant Load Elimination

    Redundant Load Elimination 6.11 Redundant Load Elimination Filter algorithms typically read the same value from memory multiple times and are, therefore, prime candidates for optimization by eliminating redundant load instructions. Rather than perform a load operation each time a particular value is read, you can keep the value in a register and read the register multiple times.
  • Page 248: Fir Filter C Code With Redundant Load Elimination

    Redundant Load Elimination 6.11.1.2 New FIR Filter C Code Example 6–61 shows that after eliminating redundant loads, there are four memory-read operations for every four multiply-accumulate operations. Now the memory accesses no longer limit the performance. Example 6–61. FIR Filter C Code With Redundant Load Elimination void fir(short x[], short h[], short y[]) int i, j, sum0, sum1;...
  • Page 249: Translating C Code To Linear Assembly

    Redundant Load Elimination 6.11.2 Translating C Code to Linear Assembly Example 6–62 shows the linear assembly that perform the inner loop of the FIR filter C code. Element x0 is read by the MPY p00 before it is loaded by the LDH x0 instruc- tion;...
  • Page 250: Dependency Graph Of Fir Filter

    Redundant Load Elimination 6.11.3 Drawing a Dependency Graph Figure 6–21 shows the dependency graph of the FIR filter with redundant load elimination. Figure 6–21. Dependency Graph of FIR Filter (With Redundant Load Elimination) A side B side .M1X .M2X .L2X sum0 sum1 sum0...
  • Page 251: Resource Table For Fir Filter Code

    Redundant Load Elimination 6.11.4 Determining the Minimum Iteration Interval Table 6–23 shows that the minimum iteration interval is 2. An iteration interval of 2 means that two multiply-accumulates are executing per cycle. Table 6–23. Resource Table for FIR Filter Code (a) A side (b) B side Unit(s)
  • Page 252: Final Assembly

    Redundant Load Elimination Example 6–63. Linear Assembly for Full FIR Code (Continued) LOOP: .trip 16 *x_1++[2],x1 ; x1 = x[j+i+1] *h++[2],h0 ; h0 = h[i] x0,h0,p00 ; x0 * h0 .M1X x1,h0,p10 ; x1 * h0 p00,sum0,sum0 ; sum0 += x0 * h0 .L2X p10,sum1,sum1 ;...
  • Page 253: Final Assembly Code For Fir Filter With Redundant Load Elimination

    Redundant Load Elimination Example 6–64. Final Assembly Code for FIR Filter With Redundant Load Elimination 50,A2 ; set up outer loop counter 80,A3 ; used to rst x ptr outer loop 82,B6 ; used to rst h ptr outer loop OUTLOOP: *A4++[2],A0 ;...
  • Page 254 Redundant Load Elimination Example 6–64 Final Assembly Code for FIR Filter With Redundant Load Elimination (Continued) LOOP: .L2X A8,B9,B9 ; sum1 += x1 * h0 A7,A9,A9 ; sum0 += x0 * h0 B1,B0,B7 ;* x1 * h1 .M1X B1,A1,A8 ;* x1 * h0 ||[B2] B LOOP ;** branch to inner loop...
  • Page 255: Memory Banks

    6.12 Memory Banks The internal memory of the ’C6000 family varies from device to device. See the TMS320C6000 Peripherals Reference Guide to determine the memory blocks in your particular device. This section discusses how to write code to avoid memory bank conflicts.
  • Page 256: Bank Interleaved Memory With Two Memory Blocks

    Memory Banks Figure 6–23. 4-Bank Interleaved Memory With Two Memory Blocks Memory block 0 8N + 1 8N + 2 8N + 3 8N + 4 8N + 5 8N + 6 8N + 7 Bank 0 Bank 1 Bank 2 Bank 3 Memory 8M + 1...
  • Page 257: Fir Filter Inner Loop

    Memory Banks 6.12.1 FIR Filter Inner Loop Example 6–65 shows the inner loop from the final assembly in Example 6–64. The LDHs from the h array are in parallel with LDHs from the x array. If x[1] is on an even halfword (bank 0) and h[0] is on an odd halfword (bank 1), Example 6–65 has no memory conflicts.
  • Page 258: Each Array On Same Loop Cycle)

    Memory Banks In the case of the FIR filter, scheduling the even and odd elements of the same array on the same loop cycle cannot be done in a 2-cycle loop, as shown in Figure 6–24. In this example, a valid 2-cycle software-pipelined loop without memory constraints is ruled by the following constraints: LDH h0 and LDH h1 are on the same loop cycle.
  • Page 259: Unrolled Fir Filter C Code

    Memory Banks 6.12.2 Unrolled FIR Filter C Code The main limitation in solving the problem in Figure 6–24 is in scheduling a 2- cycle loop, which means that no value can be live more than two cycles. In- creasing the iteration interval to 3 decreases performance. A better solution is to unroll the inner loop one more time and produce a 4-cycle loop.
  • Page 260: Translating C Code To Linear Assembly

    Memory Banks 6.12.3 Translating C Code to Linear Assembly Example 6–67 shows the linear assembly for the unrolled inner loop of the FIR filter C code. Example 6–67. Linear Assembly for Unrolled FIR Inner Loop *x++,x1 ; x1 = x[j+i+1] *h++,h0 ;...
  • Page 261: Drawing A Dependency Graph

    Memory Banks 6.12.4 Drawing a Dependency Graph Figure 6–25 shows the dependency graph of the FIR filter with no memory hits. Figure 6–25. Dependency Graph of FIR Filter (With No Memory Hits) A side B side sum0 sum1 sum0 sum1 sum0 sum1 sum0...
  • Page 262: Linear Assembly For Unrolled Fir Inner Loop With .Mptr Directive

    Memory Banks 6.12.5 Linear Assembly for Unrolled FIR Inner Loop With .mptr Directive Example 6–68 shows the unrolled FIR inner loop with the .mptr directive. The .mptr directive allows the assembly optimizer to automatically determine if two memory operations have a bank conflict by associating memory access infor- mation with a specific pointer register.
  • Page 263 Memory Banks Example 6–68. Linear Assembly for Full Unrolled FIR Filter (Continued) LOOP: .trip 8 *x_1++[2],x1 ; x1 = x[j+i+1] *h++[2],h0 ; h0 = h[i] .M1X x0,h0,p00 ; x0 * h0 x1,h0,p10 ; x1 * h0 p00,sum0,sum0 ; sum0 += x0 * h0 .L2X p10,sum1,sum1 ;...
  • Page 264: Linear Assembly Resource Allocation

    The assembly opti- mizer handles this automatically after it software pipelines the loop. See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more informa- tion. Optimizing Assembly Code via Linear Assembly...
  • Page 265: Resource Table For Fir Filter Code

    Memory Banks 6.12.7 Determining the Minimum Iteration Interval Based on Table 6–24, the minimum iteration interval for the FIR filter with no memory hits should be 4. An iteration interval of 4 means that two multiply/ac- cumulates still execute per cycle. Table 6–24.
  • Page 266: Final Assembly Code For Fir Filter With Redundant Load Elimination And No Memory Hits

    Memory Banks Example 6–69. Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits 50,A2 ; set up outer loop counter 62,A3 ; used to rst x pointer outloop 64,B10 ; used to rst h pointer outloop OUTLOOP: *A4++,B5 ;...
  • Page 267 Memory Banks Example 6–69. Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits (Continued) LOOP: .L2X A1,B9,B9 ; sum1 += x1 * h0 .L1X B6,A9,A9 ; sum0 += x1 * h1 B5,B8,B7 ; x0 * h3 A5,A7,A7 ;...
  • Page 268: Software Pipelining The Outer Loop

    Software Pipelining the Outer Loop 6.13 Software Pipelining the Outer Loop In previous examples, software pipelining has always affected the inner loop. However, software pipelining works equally well with the outer loop in a nested loop. 6.13.1 Unrolled FIR Filter C Code Example 6–70 shows the FIR filter C code after unrolling the inner loop (identi- cal to Example 6–66 on page 6-122).
  • Page 269: Making The Outer Loop Parallel With The Inner Loop Epilog And Prolog

    Software Pipelining the Outer Loop 6.13.2 Making the Outer Loop Parallel With the Inner Loop Epilog and Prolog The final assembly code for the FIR filter with redundant load elimination and no memory hits (shown in Example 6–69 on page 6-129) contained 16 cycles of overhead to call the inner loop every time: ten cycles for the loop prolog and six cycles for the outer loop instructions and branching to the outer loop.
  • Page 270: Final Assembly Code For Fir Filter With Redundant Load Elimination And No Memory Hits With Outer Loop Software-Pipelined

    Software Pipelining the Outer Loop Example 6–71. Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits With Outer Loop Software-Pipelined 50,A2 ; set up outer loop counter B11,*B15–– ; push register 74,A3 ; used to rst x ptr outer loop 72,B10 ;...
  • Page 271 Software Pipelining the Outer Loop Example 6–71. Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits With Outer Loop Software-Pipelined (Continued) A0,A9,A9 ; sum0 += x0 * h0 .M2X A5,B8,B8 ; x3 * h3 .M1X B0,A7,A5 ;...
  • Page 272: Comparison Of Fir Filter Code

    Software Pipelining the Outer Loop Example 6–71. Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits With Outer Loop Software-Pipelined (Continued) B7,B9,B9 ;e sum1 += x2 * h1 A5,A9,A9 ;e sum0 += x2 * h2 *A4++,B8 ;p x0 = x[j] .L2X...
  • Page 273: Outer Loop Conditionally Executed With Inner Loop

    Outer Loop Conditionally Executed With Inner Loop 6.14 Outer Loop Conditionally Executed With Inner Loop Software pipelining the outer loop improved the outer loop overhead in the previous example from 16 cycles to 8 cycles. Executing the outer loop condi- tionally and in parallel with the inner loop eliminates the overhead entirely.
  • Page 274: Translating C Code To Linear Assembly (Inner Loop)

    Outer Loop Conditionally Executed With Inner Loop 6.14.2 Translating C Code to Linear Assembly (Inner Loop) Example 6–73 shows a list of linear assembly for the inner loop of the FIR filter C code (identical to Example 6–67 on page 6-123). Example 6–73.
  • Page 275: Translating C Code To Linear Assembly (Outer Loop)

    Outer Loop Conditionally Executed With Inner Loop 6.14.3 Translating C Code to Linear Assembly (Outer Loop) Example 6–74 shows the instructions that execute all of the outer loop func- tions. All of these instructions are conditional on inner loop counters. Two different counters are needed, because they must decrement to 0 on different iterations.
  • Page 276: Unrolled Fir Filter C Code

    Outer Loop Conditionally Executed With Inner Loop Example 6–75. Unrolled FIR Filter C Code void fir(short x[], short h[], short y[]) int i, j, sum0, sum1; short x0,x1,x2,x3,x4,x5,x6,x7,h0,h1,h2,h3,h4,h5,h6,h7; for (j = 0; j < 100; j+=2) { sum0 = 0; sum1 = 0;...
  • Page 277: Translating C Code To Linear Assembly (Inner Loop)

    Outer Loop Conditionally Executed With Inner Loop 6.14.5 Translating C Code to Linear Assembly (Inner Loop) Example 6–76 shows the instructions that perform the inner and outer loops of the FIR filter. These instructions reflect the following modifications: LDWs are used instead of LDHs to reduce the number of loads in the loop. The reset pointer instructions immediately follow the LDW instructions.
  • Page 278: Linear Assembly For Fir With Outer Loop Conditionally Executed With Inner Loop

    Outer Loop Conditionally Executed With Inner Loop Example 6–76. Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop *h++[2],h01 ; h[i+0] & h[i+1] *h_1++[2],h23 ; h[i+2] & h[i+3] *h++[2],h45 ; h[i+4] & h[i+5] .*h_1++[2],h67 ; h[i+6] & h[i+7] *x++[2],x01 ;...
  • Page 279: Translating C Code To Linear Assembly (Inner Loop And Outer Loop)

    Outer Loop Conditionally Executed With Inner Loop Example 6–76. Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop (Continued) h23,x23,p02 ; p02 = h[i+2]*x[j+i+2] p02,sum01,sum02 ; sum0 += p02 MPYH h23,x23,p03 ; p03 = h[i+3]*x[j+i+3] p03,sum02,sum03 ; sum0 += p03 h45,x45,p04 ;...
  • Page 280: Linear Assembly For Fir With Outer Loop Conditionally Executed With Inner Loop (With Functional Units)

    Outer Loop Conditionally Executed With Inner Loop Example 6–77. Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop (With Functional Units) .global _fir _fir: .cproc x, h, y .reg x_1, h_1, y_1, octr, pctr, sctr .reg sum01, sum02, sum03, sum04, sum05, sum06, sum07 .reg sum11, sum12, sum13, sum14, sum15, sum16, sum17 .reg...
  • Page 281 Outer Loop Conditionally Executed With Inner Loop Example 6–77. Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop (With Functional Units) (Continued) .L2X x01,x01b ; move to other reg file MPYLH .M2X h01,x01b,p10 ; p10 = h[i+0]*x[j+i+1] [sctr] p10,sum17,p10 ;...
  • Page 282 Outer Loop Conditionally Executed With Inner Loop Example 6–77. Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop (With Functional Units)(Continued) h67,x67,p06 ; p06 = h[i+6]*x[j+i+6] .L1X p06,sum05,sum06 ; sum0 += p06 MPYH h67,x67,p07 ; p07 = h[i+7]*x[j+i+7] .L1X p07,sum06,sum07 ;...
  • Page 283: Resource Table For Fir Filter Code

    Outer Loop Conditionally Executed With Inner Loop 6.14.7 Determining the Minimum Iteration Interval Based on Table 6–27, the minimum iteration interval is 8. An iteration interval of 8 means that two multiply-accumulates per cycle are still executing. Table 6–27. Resource Table for FIR Filter Code (a) A side (b) B side Unit(s)
  • Page 284: Final Assembly Code For Fir Filter

    Outer Loop Conditionally Executed With Inner Loop Example 6–78. Final Assembly Code for FIR Filter .L1X B4,A0 ; point to h[0] & h[1] B4,4,B2 ; point to h[2] & h[3] .L2X A4,B1 ; point to x[j] & x[j+1] A4,4,A4 ; point to x[j+2] & x[j+3] 200,B0 ;...
  • Page 285 Outer Loop Conditionally Executed With Inner Loop Example 6–78. Final Assembly Code for FIR Filter (Continued) LOOP: [!A2] A10,15,A12 ; (Asum0 >> 15) ||[B0] B0,1,B0 ; dec outer lp cntr MPYH B7,B9,B13 ; p03 = h[i+3]*x[j+i+3] ||[A2] A7,A10,A7 ; sum0(p00) = p00 + sum0 MPYHL .M1X B7,A11,A10...
  • Page 286: Comparison Of Fir Filter Code

    Outer Loop Conditionally Executed With Inner Loop Example 6–78. Final Assembly Code for FIR Filter (Continued) .L2X A9,B8,B11 ; sum1 += p17 .L1X B11,A12,A12 ; sum0 += p06 A8,A10,A7 ;* p00 = h[i+0]*x[j+i+0] MPYLH B7,B9,B13 ;* p12 = h[i+2]*x[j+i+3] ||[A2] A2,1,A2 ;* dec store lp cntr .L1X...
  • Page 287: Interrupts

    Chapter 7 Interrupts This chapter describes interrupts from a software-programming point of view. A description of single and multiple register assignment is included, followed by code generation of interruptible code and finally, descriptions of interrupt subroutines. Topic Page Overview of Interrupts .
  • Page 288: Overview Of Interrupts

    This chapter focuses on the software issues associated with interrupts. The hardware description of interrupt operation is fully described in the TMS320C6000 CPU and Instruction Set Reference Guide . In order to understand the software issues of interrupts, we must talk about two types of code: the code that is interrupted and the interrupt subroutine, which performs the tasks required by the interrupt.
  • Page 289: Single Assignment Vs. Multiple Assignment

    Single Assignment vs. Multiple Assignment 7.2 Single Assignment vs. Multiple Assignment Register allocation on the ’C6000 can be classified as either single assignment or multiple assignment. Single assignment code is interruptible; multiple as- signment is not interruptible. This section discusses the differences between each and explains why only single assignment is interruptible.
  • Page 290: Code Using Single Assignment

    Single Assignment vs. Multiple Assignment Example 7–2. Code Using Single Assignment cycle A4,A5,A1 ; writes to A1 in single cycle *A0,A6 ; writes to A1 after 4 delay slots A1,A2,A3 ; uses old A1 (result of SUB) 5–6 NOP A6,A4,A5 ;...
  • Page 291: Interruptible Loops

    Interruptible Loops 7.3 Interruptible Loops Even if code employs single assignment, it may not be interruptible in a loop. Because the delay slots of all branch operations are protected from interrupts in hardware, all interrupts remain pending as long as the CPU has a pending branch.
  • Page 292: Interruptible Code Generation

    The tools provide 3 levels of control to the user. These levels are described in the following sections. For a full description of interruptible code generation, see the TMS320C6000 Optimizing C/C++ Compiler User’s Guide . 7.4.1 Level 0 - Specified Code is Guaranteed to Not Be Interrupted At this level, the compiler does not disable interrupts.
  • Page 293: Level 1 – Specified Code Interruptible At All Times

    Interruptible Code Generation 7.4.2 Level 1 – Specified Code Interruptible at All Times At this level, the compiler employs single assignment everywhere and never produces a loop of less than 6 cycles. The command line option –mi1 can be used for an entire module and the following pragma can be used to force this level on a particular function: #pragma FUNC_INTERRUPT_THRESHOLD(func, 1);...
  • Page 294: Getting The Most Performance Out Of Interruptible Code

    Interruptible Code Generation 7.4.4 Getting the Most Performance Out of Interruptible Code As stated in Chapter 4 and Chapter 7, the .trip directive and the MUST_ITER- ATE pragma can be used to specify a maximum value for the trip count of a loop.
  • Page 295: Dot Product With _Nassert Guaranteeing Trip Count Range

    Interruptible Code Generation Example 7–4. Dot Product With _nassert Guaranteeing Trip Count Range int dot_prod(short *a, short *b, int n) int i, sum = 0; #pragma MUST_ITERATE (20,50); for (i = 0; i < n; i++) sum += a[i] * b[i]; return sum;...
  • Page 296: Dot Product With Must_Iterate Pragma Guaranteeing Trip Count Range And Factor Of 2

    Interruptible Code Generation Example 7–5. Dot Product With MUST_ITERATE Pragma Guaranteeing Trip Count Range and Factor of 2 int dot_prod(short *a, short *b, int n) int i, sum = 0; #pragma MUST_ITERATE (20,50,2); for (i = 0; i < n; i++) sum += a[i] * b[i];...
  • Page 297: Interrupt Subroutines

    The ’C6000 provides hardware to automatically branch to this routine when an interrupt is received based on an interrupt service table. (See the Interrupt Service Table in the TMS320C6000 CPU and Instruction Set Ref- erence Guide. ) Once the branch is complete, execution begins at the first exe- cute packet of the ISR.
  • Page 298: Isr With Hand-Coded Assembly

    Interrupt Subroutines 7.5.2 ISR with Hand-Coded Assembly When writing an ISR by hand, it is necessary to handle the same tasks the C/C++ compiler does. So, the following steps must be taken: All registers used must be saved to the stack before modification. For this reason, it is preferable to maintain one general purpose register to be used as a stack pointer in your application.
  • Page 299: Nested Interrupts

    Interrupt Subroutines 7.5.3 Nested Interrupts Sometimes it is desirable to allow higher priority interrupts to interrupt lower priority ISRs. To allow nested interrupts to occur, you must first save the IRP, IER, and CSR to a register which is not being used or to or some other memory location (usually the stack).
  • Page 300: Hand-Coded Assembly Isr Allowing Nesting Of Interrupts

    Interrupt Subroutines Example 7–8. Hand-Coded Assembly ISR Allowing Nesting of Interrupts * Assume Register B0–B5 & A0 are the only registers used by the * ISR and no other functions are called B0,*B15–– ; store B0 to stack || MVC IRP, B0 ;...
  • Page 301: C64X Programming Considerations

    Chapter 8 ’C64x Programming Considerations This chapter covers material specific to the TMS320C64x series of DSPs. It builds on the material presented elsewhere in this book, with additional infor- mation specific to the VelociTI.2 extensions that the ’C64x provides. Before reading this chapter, familiarize yourself with the programming con- cepts presented earlier for the entire C6000 family, as these concepts also ap- ply to the ’C64x.
  • Page 302: Overview Of 'C64X Architectural Enhancements

    Overview of ’C64x Architectural Enhancements 8.1 Overview of ’C64x Architectural Enhancements The ’C64x is a fixed-point digital signal processor (DSP) and is the first DSP to add VelociTI.2 extensions to the existing high-performance VelociTI archi- tecture. VelociTI.2 extensions provide the following features: Greater scheduling flexibility for existing instructions Greater memory bandwidth with double-word load and store instructions Support for packed 8-bit and 16-bit data types...
  • Page 303: Non-Aligned Memory Accesses

    Instructions in this category include BITC4, BITR, ROTL, SHFL, and DEAL. See the TMS320C6000 CPU and Instruction Set User’s Guide for more details on these and related instructions.
  • Page 304: Packed-Data Processing On The 'C64X

    Packed-Data Processing on the ’C64x 8.2 Packed-Data Processing on the ’C64x 8.2.1 Introduction to Packed Data Processing Techniques Packed-data processing is a type of processing where a single instruction ap- plies the same operation to multiple independent pieces of data. For example, the ADD2 instruction performs two independent 16-bit additions between two pairs of 16-bit values.
  • Page 305: Four Bytes Packed Into A Single General Purpose Register

    Packed-Data Processing on the ’C64x Table 8–1. Packed data types Element Size Signed/Unsigned Elements in 32-bit Element type Level of support word 8 bits unsigned unsigned char high 16 bits signed short high 8 bits signed char limited 16 bits unsigned unsigned short limited...
  • Page 306: Two Half-Words Packed Into A Single General Purpose Register

    Packed-Data Processing on the ’C64x Figure 8–2. Two Half–Words Packed Into a Single General Purpose Register. 16 bits 16 bits Halfword 1 Halfword 0 General purpose Halfword 1 Halfword 0 register 32 bits Notice that there is no distinction between signed or unsigned data made in Figure 8–1 and Figure 8–2.
  • Page 307: Supported Operations On Packed Data Types

    Packed-Data Processing on the ’C64x Table 8–2. Supported Operations on Packed Data Types Operation Support for 8-bit Support for 16-bit Notes Signed Unsigned Signed Unsigned ADD/SUB Saturated ADD Booleans Uses generic boolean instruc- tions Shifts Right-shift only Multiply Dot Product Max/Min/ CMPEQ works Compare...
  • Page 308: Instructions For Manipulating Packed Data Types

    Packed-Data Processing on the ’C64x Table 8–3. Instructions for Manipulating Packed Data Types Mnemonic Intrinsic Typical Uses With Packed Data PACK2 _pack2 Packing 16-bit portions of 32-bit quantities. PACKH2 _packh2 Rearranging packed 16-bit quantities. PACKHL2 _packhl2 Rearranging pairs of 16-bit quantities. PACKLH2 _packlh2 SPACK2...
  • Page 309: Graphical Representation Of _Packxx2 Intrinsics

    Packed-Data Processing on the ’C64x Figure 8–3. Graphical Representation of _packXX2 Intrinsics b_hi b_lo a_hi a_lo c = _pack2(b, a) b_lo a_lo b_hi b_lo a_hi a_lo c=_packh2(b, a) b_hi a_hi b_hi b_lo a_hi a_lo c=_packhl2(b, a) b_hi a_lo b_hi b_lo a_hi a_lo c=_packlh2(b, a)
  • Page 310: Unpacking Packed 16-Bit Quantities To 32-Bit Values

    Packed-Data Processing on the ’C64x Figure 8–4. Graphical Representation of _spack2 32 bits 32 bits Signed 32–bit Signed 32–bit Saturation step Signed Signed 16–bit 16–bit Packing step c = _spack2(b, a) 16 bits 16 bits Notice that there are no special unpack operations for 16-bit data. Instead, the normal 32-bit right-shifts and extract operations can be used to unpack 16-bit elements into 32-bit quantities.
  • Page 311 Packed-Data Processing on the ’C64x Figure 8–5. Graphical Representation of 8–bit Packs (_packX4 and _spacku4) c = _packh4(b, a) c = _packl4(b, a) signed 16–bit signed 16–bit signed 16–bit signed 16–bit Saturation Unsigned 8-bit sat(b_hi) sat(b_lo) sat(a_hi) sat(a_lo) c = _spacku4(b, a) sat(b_hi) sat(b_lo) sat(a_hi)
  • Page 312 Packed-Data Processing on the ’C64x Figure 8–6. Graphical Representation of 8–bit Unpacks (_unpkXu4) 00000000 00000000 b = unpkhu4(a); 00000000 00000000 b = unpklu4(a); The ’C64x also provides a handful of additional byte-manipulating operations that have proven useful in various algorithms. These operations are neither packs nor unpacks, but rather shuffle bytes within a word.
  • Page 313: Optimizing For Packed Data Processing

    Packed-Data Processing on the ’C64x Figure 8–7. Graphical Representation of (_shlmb, _shrmb, and _swap4) c = _shlmb(b, a) c = _shrmb(b, a) b = _swap4(a) 8.2.5 Optimizing for Packed Data Processing The ’C64x supports two basic forms of packed-data optimization, namely vec- torization and macro operations.
  • Page 314: Graphical Representation Of

    Packed-Data Processing on the ’C64x Example 8–1. Vector Sum void vec_sum(const short *restrict a, const short *restrict b, short *restrict c, int len) int i; for (i = 0; i < len; i++) c[i] = b[i] + a[i]; Example 8–2. Vector Multiply void vec_mpy(const short *restrict a, const short *restrict b, short *restrict c, int len, int shift) int i;...
  • Page 315: Dot Product

    Packed-Data Processing on the ’C64x Although pure vector algorithms exist, most applications do not consist purely of vector operations as simple as the one shown above. More commonly, an algorithm will have portions, which behave as a vector algorithm, and portions which do not.
  • Page 316: Graphical Representation Of Dot Product

    Packed-Data Processing on the ’C64x Figure 8–9. Graphical Representation of Dot Product . . . Input A Item 0 Item 1 Item 2 Item n . . . Input B Item 0 Item 1 Item 2 Item n multiply multiply multiply multiply As you can see, this does not fit the pure vector model presented in...
  • Page 317: Graphical Representation Of A Single Iteration Of Vector Complex Multiply

    Packed-Data Processing on the ’C64x Figure 8–10. Graphical Representation of a Single Iteration of Vector Complex Multiply. Array element 2n+1 Array element 2n Input A (real component) (imaginary component) Array element 2n+1 Array element 2n Input B (real component) (imaginary component) multiply multiply multiply...
  • Page 318 Packed-Data Processing on the ’C64x stores typically occur near the very beginning and end of the loop body. The following examples use this outside-in approach to perform packed data opti- mization techniques on the example kernels. Note: The following examples assume that the compiler has not performed any packed data optimizations.
  • Page 319: Array Access In Vector Sum By Lddw

    Packed-Data Processing on the ’C64x Example 8–5. Vectorization: Using LDDW and STDW in Vector Sum oid vec_sum(const short *restrict a, const short *restrict b, short *restrict c, int len) int i; unsigned a_hi, a_lo; unsigned b_hi, b_lo; unsigned c_hi, c_lo; for (i = 0;...
  • Page 320: Array Access In Vector Sum By Stdw

    Packed-Data Processing on the ’C64x Figure 8–12. Array Access in Vector Sum by STDW c_hi c_lo c[3] c[2] c[1] c[0] _itod() intrinsic 32 bits 32 bits c[3] c[2] c[1] c[0] 64 bits . . . c[7] c[6] c[5] c[4] c[3] c[2] c[1] c[0]...
  • Page 321: Vector Addition (Complete)

    Packed-Data Processing on the ’C64x Example 8–6. Vector Addition (Complete) void vec_sum(const short *restrict a, const short *restrict b, short *restrict c, int len) int i; unsigned a3_a2, a1_a0; unsigned b3_b2, b1_b0; unsigned c3_c2, c1_c0; for (i = 0; i < len; i += 4) a3_a2 = _hi(*(const double *) &a[i]);...
  • Page 322: Graphical Representation Of A Single Iteration Of Vector Multiply

    Packed-Data Processing on the ’C64x Figure 8–14. Graphical Representation of a Single Iteration of Vector Multiply. 16 bits a[i] mult b[i] 32 bits a[i] * b[i] Right shift 16 bits c[i] Notice that the values are still loaded and stored as 16-bit quantities. There- fore, you should use the same basic flow as the vector sum.
  • Page 323: Using Lddw And Stdw In Vector Multiply

    Packed-Data Processing on the ’C64x Example 8–7. Using LDDW and STDW in Vector Multiply void vec_mpy(const short *restrict a, const short *restrict b, short *restrict c, int len, int shift) int i; unsigned a_hi, a_lo; unsigned b_hi, b_lo; unsigned c_hi, c_lo; for (i = 0;...
  • Page 324: Using _Mpy2() And _Pack2() To Perform The Vector Multiply

    Packed-Data Processing on the ’C64x The ’C64x provides the _pack family intrinsics to convert the 32-bit results into 16-bit results. The _packXX2() intrinsics, described in section 8.2.4, extract two 16-bit values from two 32-bit registers, returning the results in a single 32-bit register.
  • Page 325 Packed-Data Processing on the ’C64x This code works, but it is heavily bottlenecked on shifts. One way to eliminate this bottleneck is to use the packed 16-bit shift intrinsic, _shr2(). This can be done without losing precision, under the following conditions: If the shift amount is known to be greater than or equal to 16, use _packh2() instead of _pack2() before the shift.
  • Page 326: Fine Tuning Vector Multiply (Shift > 16)

    Packed-Data Processing on the ’C64x Figure 8–16. Fine Tuning Vector Multiply (shift > 16) Original data flow Signed 32 bit product Signed 32 bit product Right shifts Discarded Discarded 16-bit result 16-bit result sign bits sign bits _pack2 c[1] c[0] 16 bits 16 bits Modified data flow...
  • Page 327: Fine Tuning Vector Multiply (Shift < 16)

    Packed-Data Processing on the ’C64x Figure 8–17. Fine Tuning Vector Multiply (shift < 16) Original data flow Signed 32 bit product Signed 32 bit product Right shifts Discarded Discarded 16-bit result 16-bit result sign bits sign bits _pack2 c[1] c[0] 16 bits 16 bits Modified data flow...
  • Page 328: Intrinsics Which Combine Multiple Operations In One Instruction

    Packed-Data Processing on the ’C64x 8.2.7 Combining Multiple Operations in a Single Instruction The Dot Product and Vector Complex Multiply examples that were presented earlier were both examples of kernels that benefit from macro operations , that is, instructions which perform more than a simple operation. The ’C64x provides a number of instructions which combine common opera- tions together.
  • Page 329: Vectorized Form Of The Dot Product Kernel

    Packed-Data Processing on the ’C64x _dotpu4 eliminates three adds. The following sections describe how to write the Dot Product and Vector Complex Multiply examples to take advantage of these. 8.2.7.1 Combining Operations in the Dot Product Kernel The Dot Product kernel, presented in Example 8–3, is one which benefits both from vectorization as well as macro operations.
  • Page 330: Graphical Representation Of The _Dotp2 Intrinsic C = _Dotp2(B, A)

    Packed-Data Processing on the ’C64x While this code is fully vectorized, it still can be improved. The kernel itself is performing two LDDWs, two MPY2, four ADDs, and one Branch. Because of the large number of ADDs, the loop cannot fit in a single cycle, and so the ’C64x datapath is not used efficiently.
  • Page 331: Vectorized Form Of The Dot Product Kernel

    Packed-Data Processing on the ’C64x Example 8–10. Vectorized Form of the Dot Product Kernel int dot_prod(const short *restrict a, const short *restrict b, short *restrict c, int len) int i; sum = 0; /* 32–bit accumulation unsigned a3_a2, a1_a0; /* Packed 16–bit values unsigned b3_b2, b1_b0;...
  • Page 332 Packed-Data Processing on the ’C64x 8.2.7.2 Combining Operations in the Vector Complex Multiply Kernel The Vector Complex Multiply kernel that was originally shown in Example 8–4 can be optimized with a technique similar to the one that used with the Dot Product kernel in Section 8.2.4.1.
  • Page 333: Vectorized Form Of The Vector Complex Multiply Kernel

    Packed-Data Processing on the ’C64x Example 8–12. Vectorized form of the Vector Complex Multiply Kernel void vec_cx_mpy(const short *restrict a, const short *restrict b, short *restrict c, int len, int shift) int i; unsigned a3_a2, a1_a0; /* Packed 16–bit values unsigned b3_b2, b1_b0;...
  • Page 334: The _Dotpn2 Intrinsic Performing Real Portion Of Complex Multiply

    Packed-Data Processing on the ’C64x Example 8–12 still performs the complex multiply as a series of discrete steps once the individual elements are loaded. The next optimization step is to com- bine some of the multiplies and adds/subtracts into _dotp and _dotpn intrinsics in a similar manner to the Dot Product example presented earlier.
  • Page 335: Packlh2 And _Dotp2 Working Together

    Packed-Data Processing on the ’C64x The solution is to reorder the halfwords from one of the inputs, so that the imag- inary component is in the upper halfword and the real component is in the lower halfword. This is accomplished by using the _packlh2 intrinsic to reorder the halves of the word.
  • Page 336: Vectorized Form Of The Vector Complex Multiply

    Packed-Data Processing on the ’C64x Example 8–13. Vectorized form of the Vector Complex Multiply void vec_cx_mpy(const short *restrict a, const short *restrict b, short *restrict c, int len, int shift) int i; unsigned a3_a2, a1_a0; /* Packed 16–bit values unsigned b3_b2, b1_b0; /* Packed 16–bit values c3,c2, c1,c0;...
  • Page 337: Comparison Between Aligned And Non-Aligned Memory Accesses

    Packed-Data Processing on the ’C64x 8.2.8 Non-Aligned Memory Accesses In addition to traditional aligned memory access methods, the ’C64x also pro- vides intrinsics for non-aligned memory accesses. Aligned memory accesses are restricted to an alignment boundary that is determined by the amount of data being accessed.
  • Page 338: Non–Aligned Memory Access With _Mem4 And _Memd8

    Packed-Data Processing on the ’C64x 8.2.8.1 Using Non-Aligned Memory Access Intrinsics Non-aligned memory accesses are generated using the _mem4() and _memd8() intrinsics. These intrinsics generate a non-aligned reference which may be read or written to, much like an array reference. Example 8–14 below illustrates reading and writing via these intrinsics.
  • Page 339 Packed-Data Processing on the ’C64x 8.2.8.2 When to Use Non-Aligned Memory Accesses As noted earlier, the ’C64x can provide 128 bits/cycle bandwidth with aligned memory accesses, and 64 bits/cycle bandwidth with non-aligned memory ac- cesses. Therefore, it is important to use non–aligned memory accesses in places where they provide a true benefit over aligned memory accesses.
  • Page 340: Graphical Illustration Of _Cmpxx2 Intrinsics

    Packed-Data Processing on the ’C64x 8.2.9 Performing Conditional Operations with Packed Data The ’C64x provides a set of operations that are intended to provide conditional data flow in code that operates on packed data. These operations make it pos- sible to avoid breaking the packed data flow with unpacking code and tradition- al ’if’...
  • Page 341: Graphical Illustration Of _Cmpxx4 Intrinsics

    Packed-Data Processing on the ’C64x Figure 8–22. Graphical Illustration of _cmpXX4 Intrinsics The _cmpXX4 operation 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 c = cmpXX4(a, b) The expand intrinsics work from a bitfield such as the bitfield returned by the compare intrinsics.
  • Page 342: Graphical Illustration Of _Xpnd2 Intrinsic

    Packed-Data Processing on the ’C64x Figure 8–23. Graphical Illustration of _xpnd2 Intrinsic xpnd xpnd b_hi b_lo b = xpnd2(a) Figure 8–24. Graphical Illustration of _xpnd4 Intrinsic xpnd xpnd xpnd xpnd b = xpnd4(a) 8-42...
  • Page 343: Clear Below Threshold Kernel

    Packed-Data Processing on the ’C64x Example 8–16 illustrates an example that can benefit from the packed compare and expand intrinsics in action. The Clear Below Threshold kernel scans an image of 8-bit unsigned pixels, and sets all pixels that are below a certain threshold to 0.
  • Page 344: Clear Below Threshold Kernel, Using _Cmpgtu4 And _Xpnd4 Intrinsics

    Packed-Data Processing on the ’C64x Example 8–17. Clear Below Threshold Kernel, Using _cmpgtu4 and _xpnd4 Intrinsics void clear_below_thresh(unsigned char *restrict image, int count, unsigned char threshold) int i; unsigned t3_t2_t1_t0; /* Threshold (replicated) unsigned p7_p6_p5_p4, p3_p2_p1_p0; /* Pixels unsigned c7_c6_c5_c4, c3_c2_c1_c0; /* Comparison results unsigned x7_x6_x5_x4, x3_x2_x1_x0;...
  • Page 345: Linear Assembly Considerations

    Linear Assembly Considerations 8.3 Linear Assembly Considerations The ’C64x supports linear assembly programming via the C6000 Assembly Optimizer. The operation of the Assembly Optimizer is described in detail in the Optimizing C/C++ Compiler User’s Guide. This section covers ’C64x spe- cific aspects of linear assembly programming.
  • Page 346: Loop Trip Count In C

    Linear Assembly Considerations Example 8–18. Loop Trip Count in C int count_loop_iterations(int count) int iters, i; iters = 0; for (i = count; i > 0; i––) iters++; return iters; Without BDEC and BPOS, this loop would be written as shown in Example 8–19 below.
  • Page 347: Loop Trip Count Using Bdec

    Linear Assembly Considerations Example 8–20. Loop Trip Count Using BDEC .global _count_loop_iterations _count_loop_iterations .cproc count .reg i, iters ZERO iters ; Initialize our return value to 0. count, ; i = count – 1; BDEC loop, ; Do not iterate if count < 1. does_not_iterate: .return iters ;...
  • Page 348: Using The .Call Directive In Linear Assembly

    Linear Assembly. However, Linear Assembly provides the func- tion call directive, .call, and this directive makes use of ADDKPC. The .call di- rective is explained in detail in the TMS320C6000 Optimizing C/C++ Compiler User’s Guide . Example 8–22 illustrates a simple use of the .call directive. The Assembly Op- timizer issues an ADDKPC as part of the function call sequence for this .call,...
  • Page 349: Avoiding Cross Path Stalls

    Most ’C64x implementations will have different memory bank structure than existing ’C62x implementations in order to support the wider memory ac- cesses that the ’C64x provides. Refer to the TMS320C6000 Peripherals Ref- erence Guide (SPRU190) for specific information on the part that you are us- ing.
  • Page 350: C64X Data Cross Paths

    This is known as a cross path stall. This stall is in- serted automatically by the hardware; no NOP instruction is needed. For more information, see the TMS320C6000 CPU and Instruction Set Reference Guide (SPRU189). This cross path stall does not occur on the ’C62x/’C67x.
  • Page 351: Avoiding Cross Path Stalls: Weighted Vector Sum Example

    With appropriate scheduling, the ’C64x can provide one cross path operand per data path per clock cycle with no stalls. In many cases, the TMS320C6000 Optimizing C Compiler and Assembly Optimizer automati- cally perform this scheduling as demonstrated in Example 8–24.
  • Page 352: Avoiding Cross Path Stalls: Partitioned Linear Assembly

    Linear Assembly Considerations Example 8–25. Avoiding Cross Path Stalls: Partitioned Linear Assembly .global _w_vec _w_vec: .cproc a, b, c, m .reg ai_i1, bi_i1, pi, pi1, pi_i1, pi_s, pi1_s .reg mask, bi, bi1, ci, ci1, c1, cntr –1, mask MVKH 0, mask ;...
  • Page 353: Avoiding Cross Path Stalls: Vector Sum Loop Kernel

    Linear Assembly Considerations The code above is sent to the assembly optimizer with the following compiler options: –o3, –mi, –mt, –k, and –mg. Since a specific C6000 platform was not specified , the default is to generate code for the ’C62x. The –o3 option enables the highest level of the optimizer.
  • Page 354: Avoiding Cross Path Stalls: Assembly Output Generated For Weighted Vector Sum Loop Kernel

    Linear Assembly Considerations In Example 8–27 below, the assembly output generated by the assembly opti- mizer for the weighted vector sum loop kernel compiled with the –mv6400 –o3 –mt –mi –k –mg options: Example 8–27. Avoiding Cross Path Stalls: Assembly Output Generated for Weighted Vector Sum Loop Kernel LOOP: ;...
  • Page 355: Feedbacksolutions

    Appendix A Appendix A FeedbackSolutions This appendix is provided as a quick reference to techniques that can be used to optimize loops and in most cases, refers you to specific sections within this book, providing additional detail. Topic Page Loop Disqualification Messages .
  • Page 356: A.1 Loop Disqualification Messages

    Solution If the caller and the callee are C or C++, use –pm and –op2. See the TMS320C6000 Opimizing C/C++ Compiler User’s Guide for more information on the correct usage of –op2. Do not use –oi0, which disables automatic inlin- ing.
  • Page 357 Loop Carried Dependency Bound Too Large If the loop has complex loop control, try –mh according to the recommenda- tions in the TMS320C6000 Optimizing C/C++ Compiler User’s Guide. Cannot Identify Trip Counter The loop control is too complex. Try to simplify the loop.
  • Page 358: A.2 Pipeline Failure Messages

    Pipeline Failure Messages A.2 Pipeline Failure Messages Address Increment Too Large Description A particular function the compiler performs when software pipelining is to allow reordering of all loads and stores occurring from the same array or pointer. This allows for maximum flexibility in scheduling. Once a schedule is found, the compiler will return and add the appropriate offsets and increment/decre- ments to each load and store.
  • Page 359 For More Information... See section 6.9, Loop Unrolling (in Assembly) , on page 6-94. See section 3.4.3.4, Loop Unrolling (in C) , on page 3-44. TMS320C6000 C/C++ Compiler User’s Guide Cycle Count Too High. Not Profitable Description In rare cases, the iteration interval of a software pipelined loop is higher than a non-pipelined list scheduled loop.
  • Page 360 Pipeline Failure Messages Did Not Find Schedule Description Sometimes, due to a complex loop or schedule, the compiler simply cannot find a valid software pipeline schedule at a particular iteration interval. Solution Split into multiple loops or reduce the complexity of the loop if possible. Unpartition/repartition the linear assembly source code.
  • Page 361 Pipeline Failure Messages Iterations in Parallel > Min. Trip Count Description Based on the available information on the minimum trip count, it is not always safe to execute the pipelined version of the loop. Normally, a redundant loop would be generated. However, in this case, redundant loop generation has been suppressed via the –ms0/–ms1 option.
  • Page 362 Pipeline Failure Messages Solution Write linear assembly and insert MV instructions to split register lifetimes that are live–too–long. For more information... See section 6.10.4.1, Split–Join–Path Problems , on page 6-104. Too Many Predicates Live on One Side Description The C6000 has predicate, or conditional, registers available for use with condi- tional instructions.
  • Page 363 Pipeline Failure Messages Solution Split into multiple loops or reduce the complexity of the loop if possible. Unpartition/repartition the linear assembly source code. Probably best modified by another technique (i.e. loop unrolling). Modify the register and/or partition constraints in linear assembly. For more information...
  • Page 364: A.3 Investigative Feedback

    Investigative Feedback A.3 Investigative Feedback Loop Carried Dependency Bound is Much Larger Than Unpartitioned Resource Bound Description If the loop carried dependency bound is much larger than the unpartitioned re- source bound, this can be an indicator that there is a potential memory alias disambiguation problem.
  • Page 365 Investigative Feedback Two Loops are Generated, One Not Software Pipelined Description If the trip count is too low, it is illegal to execute the software pipelined version of the loop. In this case, the compiler could not guarantee that the minimum trip count would be high enough to always safely execute the pipelined ver- sion.
  • Page 366 Investigative Feedback Larger Outer Loop Overhead in Nested Loop Description In cases where the inner loop count of a nested loop is relatively small, the time to execute the outer loop can start to become a large percentage of the total execution time.
  • Page 367 Investigative Feedback T Address Paths Are Resource Bound Description T address paths defined the number of memory accesses that must be sent out on the address bus each loop iteration. If these are the resource bound for the loop, it is often possible to reduce the number of accesses by performing word accesses (LDW/STW) for any short accesses being performed.
  • Page 368 Index Index linear dot product, fixed-point, 6-10, 6-20, 6-24, 6-30, 6-39 dot product, floating-point, 6-21, 6-25, 6-31, _add2 intrinsic, 3-25 6-40 aliasing, 3-9 FIR filter, 6-113, 6-115, 6-124, 6-126 FIR filter, outer loop, 6-139 allocating resources FIR filter, outer loop conditionally executed conflicts, 6-65 with inner loop, 6-142, 6-144 dot product, 6-23...
  • Page 369 Index floating-point, 6-20 FIR filter, 3-29, 3-47, 4-4, 6-111, 6-123 inner loop completely unrolled, 3-48 .D functional units, 5-7 optimized form, 3-30 data types, 3-2 unrolled, 6-132, 6-137, 6-140 with redundant load elimination, 6-112 dependency graph if-then-else, 6-87, 6-95 dot product, fixed-point, 6-12 IIR filter, 6-78 dot product, fixed-point live-too-long, 6-102...
  • Page 370 Index linear assembly for inner loop with LDW and linear assembly allocated resources, 6-24 for inner loop, 6-113 nonparallel assembly code, 6-14 for outer loop, 6-139 parallel assembly code, 6-15 for unrolled inner loop, 6-124 floating-point for unrolled inner loop with .mptr directive, assembly code with LDW before software pi- 6-126 pelining, 6-27...
  • Page 371 Index with hand-coded assembly, 7-12 in writing parallel code, 6-11 live-too-long resolution, 6-107 with the C compiler, 7-11 weighted vector sum, 6-62 interrupts little-endian mode, and MPY operation, 6-21 overview, 7-2 single assignment versus multiple assignment, live-too-long 7-3–7-4 code, 6-68 C code, 6-102 intrinsics inserting move (MV) instructions, 6-106...
  • Page 372 Index move (MV) instruction, 6-106 program-level optimization, 3-7 _mpy intrinsic, 3-28 prolog, 3-41, 6-51, 6-53 _mpyh ( ) intrinsic, 3-28 pseudo-code, for single-cycle accumulator with ADDSP, 6-37 _mpyhl intrinsic, 3-25 _mpylh intrinsic, 3-25 multicycle instruction, staggered accumulation, 6-37 multiple assignment, code example, 7-3 MUST_ITERATE, 3-25 redundant load elimination, 6-111...
  • Page 373 Index C code, 3-8 with const keywords, _nassert, word reads, 3-25 techniques with const keywords, _nassert, word reads, for priming the loop, 6-51 and loop unrolling, 3-46 for refining C code, 3-16 with const keywords,_nassert, and word reads for removing extra instructions, 6-45, 6-55 (generic), 3-26, 3-27 using intrinsics, 3-16 with three memory operations, 3-45...

Table of Contents