Infineon Technologies TriCore Compiler User Manual

Hide thumbs

Table Of Contents

Table of Contents

Quick Links

U s e r ' s M a n u a l , V 1 . 4 , D e c e m b e r 2 0 0 3

T r i C o r e ™ C o m p i l e r

W r i t e r ' s G u i d e

3 2 - b i t U n i f i e d P r o c e s s o r

M i c r o c o n t r o l l e r s

N e v e r

s t o p

t h i n k i n g .

Table of Contents

Need help?

Do you have a question about the TriCore Compiler and is the answer not in the manual?

Questions and answers

Summary of Contents for Infineon Technologies TriCore Compiler

Page 1 U s e r ’ s M a n u a l , V 1 . 4 , D e c e m b e r 2 0 0 3 T r i C o r e ™ C o m p i l e r W r i t e r ’...
Page 2 Infineon Technologies Components may only be used in life-support devices or systems with the express written approval of Infineon Technologies, if a failure of such components can reasonably be expected to cause the failure of that life-support device or system, or to affect the safety or effectiveness of that device or system. Life support devices or systems are intended to be implanted in the human body, or to support and/or maintain and sustain and/or protect human life.
Page 3 U s e r ’ s M a n u a l , V 1 . 4 , D e c e m b e r 2 0 0 3 T r i C o r e ™ C o m p i l e r W r i t e r ’...
Page 4 TriCore™ Compiler Writer’s Guide Revision History: 2003-12 V1.4 Previous Version: Page Subjects (major changes since last revision) Updated to include TriCore 2 V1.2 Comparisions between Rider A and Rider B removed V1.3 TC2 References V1.4 Sections 3.3 to 3.10 revised. New FPU sections 1.5 and 2.1.2.1 added.
Page 5: Table Of Contents
TriCore 32-bit Unified Processor Compiler Writer’s Guide Table of Contents Page Preface ............vii Optimization Strategies .
Page 6 TriCore 32-bit Unified Processor Compiler Writer’s Guide Table of Contents Page 2.1.2.1 Floating Point Unit (FPU) Pipeline ....... 33 2.1.2.2 TriCore 1.2/1.3 Regular Integer versus MAC Pipelines .
Page 7: Preface
TriCore 32-bit Unified Processor Compiler Writer’s Guide Preface This document is intended as a supplement to the TriCore architecture manual, for use by compiler writers and code generation tool vendors. It presumes a degree of familiarity with the architecture manual, although it can be read by experienced compiler writers as an “inside introduction”...
Page 8: Optimization Strategies
TriCore 32-bit Unified Processor Compiler Writer’s Guide Optimization Strategies Most of the optimization strategies described in this chapter are equally applicable to the 1.2, 1.3 and 2.0 implementations of the TriCore Instruction Set Architecture (ISA). Where differences exist, they are noted. Using 16-bit Instructions To achieve high performance with high code density, the TriCore architecture supports both 16-bit and 32-bit instruction sizes.
Page 9 TriCore 32-bit Unified Processor Compiler Writer’s Guide A reordering of operations might be able to remove an interference edge between the result and one of its operands, allowing a short form dyadic instruction to be used. Interference, in general, is dependent on the order of operations. In the canonical order from which the interference graph is typically built, the last use of an input operand might follow the operation, creating interference between the operand and the result.
Page 10: 16-Bit Loads And Stores
TriCore 32-bit Unified Processor Compiler Writer’s Guide 1.1.2 16-bit Loads and Stores The TriCore architecture includes short instruction forms for all of the most frequently used types of Load and Store instructions: • LD.BU, LD.H, LD.W, LD.A • ST.B, ST.H, ST.W, ST.A Each of these instructions has variants for four different addressing modes: •...
Page 11: Other Implicit D15 Instructions
TriCore 32-bit Unified Processor Compiler Writer’s Guide This means that the data element (in the case of D15) or the address expression (in the case of A15) should have a lifetime that interferes as little as possible with other candidate uses of the register. That usually means a short lifetime. An easy example is a pointer which is loaded and then used to access a record field, after which it has no further uses.
Page 12: Instructions With 8-Bit Constants
TriCore 32-bit Unified Processor Compiler Writer’s Guide 1.1.4.2 Instructions with 8-bit Constants D15 is also the implicit destination register for a variant of MOV whose source operand is a zero-extended 8-bit constant. It is the implicit source and destination for a variant of OR with a zero-extended 8-bit constant as its right operand.
Page 13: Auto-Adjust Addressing
TriCore 32-bit Unified Processor Compiler Writer’s Guide Not included in this section are the bit instructions (bit load and store, bit move, bit logical and jump on bit). To use these instructions effectively a number of special considerations need to be noted, and they are covered later in a separate section. 1.2.1 Auto-adjust Addressing Two of the addressing modes available for all 32-bit Load and Store instructions are pre-...
Page 14: Circular Addressing
TriCore 32-bit Unified Processor Compiler Writer’s Guide 1.2.2 Circular Addressing Although the TriCore architecture supports circular buffer addressing for efficient real- time data processing and DSP applications, it is decidedly non-trivial for a compiler to recognize, from generic C source code, when use of circular addressing is appropriate. The problem is that there are many ways that the effect of circular addressing can be expressed in the source code, and it requires specialized and fairly elaborate analysis to be able to recognize them.
Page 15: Min And Max
TriCore 32-bit Unified Processor Compiler Writer’s Guide Replacing an actual call to the abs() library function with the ABS instruction is problematic. In theory the user should be able to shadow the standard library with an abs() function that does something different than (or in addition to) the expected. It might for example create a histogram of the input arguments for analysis purposes.
Page 16: Conditional Instructions
TriCore 32-bit Unified Processor Compiler Writer’s Guide 1.2.8 Conditional Instructions There are a limited number of conditional instructions in the TriCore instruction set. They include conditional move (CMOV and CMOVN), select (SEL and SELN), conditional add (CADD and CADDN) and conditional subtract (CSUB and CSUBN). The control condition is the true (non-zero) or false (zero) status of a specified data register (The 'opN' forms perform op when the control condition is false).
Page 17: Extending The Use Of Conditional Instructions
TriCore 32-bit Unified Processor Compiler Writer’s Guide “!(<expr>)”, and allows <expr> to be used directly as the control expression for the complement conditional form. 1.2.9 Extending the Use of Conditional Instructions Although the TriCore architecture does not support the conditional execution of most instructions, it is frequently possible to combine one or more unconditional instructions, followed by a SEL or a CMOV to avoid branching.
Page 18: Accumulating Compare Instructions
TriCore 32-bit Unified Processor Compiler Writer’s Guide The latter would arguably be a more intuitive definition, but its implementation would require either four parallel operand reads in one cycle, or a true “conditional write” to the destination register. Neither solution is really practical. The problem with the former is obvious.
Page 19: Using Bit Instructions
TriCore 32-bit Unified Processor Compiler Writer’s Guide The optimal TriCore code to translate this rather complex macro is only six instructions. Assuming A is in D4, B is in D5, and the result is returned in D2, the assembly code would D2,D4,0 AND.LT D2,D5,0...
Page 20: Template Matching For Extract
TriCore 32-bit Unified Processor Compiler Writer’s Guide individual bits within a word, to perform logical operations on individual bits, and to conditionally branch on individual bit status. Packing a large number of discrete boolean variables into a single word held in a register can be particularly helpful for code that implements complex finite state machines for control applications.
Page 21: Template Matching For Insert
TriCore 32-bit Unified Processor Compiler Writer’s Guide An IR expression whose root operator is ’&’, with left or right operand sub expression a -1) << p)), literal of either form; i.e. can be replaced by an intrinsic function call for the EXTR.U instruction. If the other operand sub expression has a root operator of >>, then the shift amount for that sub expression can be added to the p argument of the intrinsic function call, and the >>...
Page 22: Bit-Logical Instructions
TriCore 32-bit Unified Processor Compiler Writer’s Guide 1.3.3 Bit-Logical Instructions The bit-logical instructions include the simple bit-logical instructions and the so-called accumulating bit-logical instructions. They are mainly intended to facilitate programming of finite state machines for control applications. The simple bit-logical instructions (AND.T, NAND.T, OR.T, NOR.T, XOR.T, XNOR.T, ANDN.T and ORN.T) take two bit values as input operands, and return a TRUE / FALSE (1 / 0) result in the destination register.
Page 23: Call Related Optimizations
TriCore 32-bit Unified Processor Compiler Writer’s Guide To update the register copy of such a variable, it is normally necessary to use the INS.T instruction to move the source bit value into the target bit location. To update the memory copy, the register copy of the word containing the variable can be stored directly, provided that it still holds currently valid copies of all the other variables in the word;...
Page 24: Tail Call Conversion
TriCore 32-bit Unified Processor Compiler Writer’s Guide For a RET (Return from CALL), it specifies how much of the alternate bank must be reloaded from its CSA (Context Save Area) memory before switching banks and returning to the caller. A quarter context save or restore takes only three cycles rather than nine, in those cases where an actual memory transfer is required.
Page 25: Qseed Fpu (Floating Point Unit) Instruction
TriCore 32-bit Unified Processor Compiler Writer’s Guide making tail calls to recursive functions. Again however, it complicates debugging, and should probably be suppressed when the user compiles with the “-g” flag. QSEED FPU (Floating Point Unit) Instruction TriCore 1.2/1.3 can be additionally configured with a single precision Floating Point Unit (FPU) via a co-processor interface that enables significant acceleration of single precision floating point operations over that of emulation libraries.
Page 26: Miscellaneous Considerations
TriCore 32-bit Unified Processor Compiler Writer’s Guide Miscellaneous Considerations 1.6.1 LOOP Instructions The instruction LOOP, provides zero-overhead looping for counted loops. Placed at the end of the loop it's first execution initializes a loop cache entry. From then on the loop cache entry monitors the instruction fetch address, and executes the loop instruction when it sees its address coming up in the fetch address stream.
Page 27: Shift Instructions
TriCore 32-bit Unified Processor Compiler Writer’s Guide implementation strategy incurs a code space penalty of one word for every load of a given literal, after the first load, but leads to more predictable execution times for system configurations involving cached DRAM. The latter strategy is more efficient for configurations with SRAM data memory.
Page 28 TriCore 32-bit Unified Processor Compiler Writer’s Guide break; default: The translated code, using the compact code strategy described, would be: <evaluate (c) into d15> .red_test: jne16 d15,0,.green_test ; code for 'case red' j .continue .green_test: jne16 d15,1,.blue_test ; code for 'case green .continue .blue_test: jne16...
Page 29 TriCore 32-bit Unified Processor Compiler Writer’s Guide Note the code movement for the default case, allowing fall-through from the last case test. With this approach the branches have greater spans, and are less likely to resolve to 16- bit forms. The approach therefore involves a degree of space / time trade-off. For switch statements with many arms, neither of the above approaches are particularly attractive.
Page 30: Implementation Information
TriCore 32-bit Unified Processor Compiler Writer’s Guide Implementation Information Pipeline Model and Instruction Scheduling This section provides an overview of the pipeline models for both the TriCore 1.2/1.3 and the TriCore 2.0 implementations of the architecture. The purpose is to enable compiler writers to construct models sufficient for use by an instruction scheduler.
Page 31: Simple Pipeline Model For Tricore 1.2/1.3 And Tricore 2.0
TriCore 32-bit Unified Processor Compiler Writer’s Guide 2.1.1.1 Simple Pipeline Model for TriCore 1.2/1.3 and TriCore 2.0 TriCore 1.2/1.3 and TriCore 2.0 implementations have two regular instruction pipelines, together with a special zero-overhead loop instruction cache. The two regular pipelines are referred to as the Integer pipeline, and the Load-Store (LS) pipeline.
Page 32: Simple Scheduling Strategy
TriCore 32-bit Unified Processor Compiler Writer’s Guide Dual pipeline instructions cannot issue in parallel with either integer pipeline or LS pipeline instructions. There are relatively few instructions in this category. ADDSC.A, MOV.A, MOV.D, and the address compare instructions (EQ.A, etc.) are the only ALU- style instructions.
Page 33: Latency Considerations
TriCore 32-bit Unified Processor Compiler Writer’s Guide The relatively high priority given to selection of dual pipeline instructions is intended to retire them early, when there is no penalty, so that they unblock dependent LS instructions and enable more pairing of integer and LS pipeline instructions. In practice they will have little impact on scheduling.
Page 34: Tricore 1.2/1.3 Regular Integer Versus Mac Pipelines
TriCore 32-bit Unified Processor Compiler Writer’s Guide In TriCore 2.0 the situation is different in that the FPU has been integrated into the core and behaves as if it is part of the IP pipeline. The compiler writer should therefore schedule FPU instructions as multicycle IP instructions, which are described in the following sections.
Page 35: Tricore 1.2/1.3 Regular Integer Versus Mac Pipelines
TriCore 32-bit Unified Processor Compiler Writer’s Guide dependence is on the accumulator input. However, If it uses the accumulator result from the preceding operation as a multiplier input, the second MAC operation will stall for one cycle. 2.1.2.3 TriCore 1.2/1.3 Regular Integer versus MAC Pipelines In TriCore 2.0 implementations there is no writeback contention between regular IP instructions and MAC operations.
Page 36: Tricore 1.2/1.3 Definition-To-Store Latencies
TriCore 32-bit Unified Processor Compiler Writer’s Guide calculation. That is because effective addresses are computed in the decode stage, rather than the execute stage. The compiler should always try to schedule a non-dependent operation into that delay slot. 2.1.2.5 Definition-to-Store Latencies TriCore 1.2/1.3 For stores, the computed EA (Effective Address) is staged for one or more cycles, before being presented to the DMU (Data Memory Unit) along with the store data.
Page 37: Tricore 2.0 Load-To-Use Latencies
TriCore 32-bit Unified Processor Compiler Writer’s Guide 2.1.2.6 TriCore 2.0 Load-to-Use Latencies Table 5 describes the stages of the TC2 Load-Store pipeline, starting from the decode stage. Table 5 TriCore 2.0 Load-Store Pipeline Stages Name Activities Decode Instruction decode and register access Execute-1 Effective Address (EA) calculation and MMU translation Execute-2...
Page 38: Multi-Cycle Instructions
TriCore 32-bit Unified Processor Compiler Writer’s Guide 2.1.2.8 Multi-Cycle Instructions Some TriCore instructions are multi-cycle instructions. This means that they consume multiple instruction issue cycles. They differ in that respect from the 16-bit MAC instructions, which use an added pipeline stage but only one issue cycle. The multi-cycle instructions effectively reissue themselves one or more times.
Page 39: Block Ordering And Alignment
TriCore 32-bit Unified Processor Compiler Writer’s Guide 2.1.3 Block Ordering and Alignment The previous section dealt with issues that affect the scheduling of code within basic blocks. It described the pipeline model, as it appears when instruction issue is not limited by the availability of code in the instruction fetch buffers.
Page 40: Tricore 2.0 Branch Timings
TriCore 32-bit Unified Processor Compiler Writer’s Guide conditional, the actual branch direction is resolved. If the branch direction is contrary to the prediction, then the correct target address is sent to the fetch unit at the end of the execute cycle. The decode cycle for the first instruction at the branch target address follows the execute cycle for the branch instruction.
Page 41 TriCore 32-bit Unified Processor Compiler Writer’s Guide rise to the minimum number of unconditional branches required for control flow connectivity. However, given information on individual branch probabilities, performance can usually be improved through informed ordering of the blocks. Determination of an optimal block ordering is in principle, an NP-complete problem.
Page 42: Pipeline Balancing
TriCore 32-bit Unified Processor Compiler Writer’s Guide Figure 1 Control Flow Subgraph for if Statement Pipeline Balancing When the initial translation for a block contains unused integer or LS (Load/Store) pipeline issue slots (i.e. the block contains predominantly one type of instruction or the other, rather than a balanced mix that can be paired for parallel issue), there are various transformations that the compiler can apply to improve the balance and reduce overall execution time.
Page 43 TriCore 32-bit Unified Processor Compiler Writer’s Guide have no dependencies on operations within the block from which they are moved, and they cannot be Loads or Stores involving variables declared volatile. Here it is assumed that the predecessor block has multiple successors (commonly two), and that the result of the promoted operation is not used in every successor.
Page 44: Advanced Optimizations
So a TriCore compiler must ensure that either early analysis is performed to completely avoid the expensive intermediate steps on the loops that are not parallelizable, or apply the appropriate inverse transformations on the non-parallelizable loops.
Page 45: Data Dependence Analysis
TriCore 32-bit Unified Processor Compiler Writer’s Guide Data Dependence Analysis C programs impose certain dependencies on variables (scalar and array) to be written and read in a particular order. As long as the order of the reads and writes on these variables is preserved, the rest of the operations can be executed in any order, or even concurrently.
Page 46: Forall Loops
TriCore 32-bit Unified Processor Compiler Writer’s Guide FORALL Loops FORALL is a loop construct in the Fortran 90 standard. The for and FORALL Loop Order of Execution table below, illustrates the differences in the order of execution of C for loop and Fortran 90 FORALL Loops. •...
Page 47: Strip-Mining
The performance of this loop can be improved between two or four times depending on whether the data element sizes are either half-word or byte, respectively. For this loop the TriCore compiler should be capable of performing strip-mining and then generate packed data instructions.
Page 48: Scalar Expansion & Loop Peeling
TriCore 32-bit Unified Processor Compiler Writer’s Guide for (IS = X; IS < N; IS=IS+S) { A[IS:IS+S-1:1] = B[IS:IS+S-1:1] + C[IS:IS+S-1:1]; Figure 4 Triplet Notation Code for Parallel Section in the Loop of Figure 3 for (IS = X; IS < N; IS=IS+S) { LD.W D0, B[I];...
Page 49 TriCore 32-bit Unified Processor Compiler Writer’s Guide In some cases the scalar will be used first before being assigned in the loop, where the application of loop peeling will adjust the iterations so that scalar is assigned and used respectively. Consider the following example: There is a loop-carried anti-dependence due to the scalar T and the loop cannot be converted to a FORALL loop.
Page 50: Loop Interchange
TriCore 32-bit Unified Processor Compiler Writer’s Guide Loop Interchange With multi-nested loops there is an opportunity to convert any of the loops to a FORALL loop. However the loop to be converted should be chosen on the basis that the array memory accesses are contiguous.
Page 51: Loop Reversal
TriCore 32-bit Unified Processor Compiler Writer’s Guide Loop Reversal Sometimes a loop cannot be converted to a FORALL loop if the loop index is decreasing; i.e. the array elements are accessed in the reverse order. This does not correspond well with the packed arithmetic array accesses.
Page 52: Reductions
This eliminates the need for the epilogue code to accumulate the partial accumulates. There are many important kernels which are reductions. So it is important that the TriCore compiler has the capability to detect the reduction loops and generate the packed arithmetic.
Page 53 TriCore 32-bit Unified Processor Compiler Writer’s Guide long ST; ST = 0; for (I = 0; I < 100; I=I+4) { ST = ST + A[I:I+3:1]; extr Da, ST, #16, #16 add ST, Da extr Da, ST, #8, #8 add S, ST, Da Figure 17 Strip-mined Loop with Triplet Code for Loop of Figure 16...
Page 54: Miscellaneous Transformations
In general, a series of transformations such as those described above may need to be applied to achieve maximum performance for a nested loop, and a TriCore compiler must use packed data instructions to exploit maximum performance wherever possible.
Page 55 TriCore 32-bit Unified Processor Compiler Writer’s Guide User’s Manual V1.4, 2003-12...
Page 56: Dsp Support
TriCore 32-bit Unified Processor Compiler Writer’s Guide DSP Support DSP support for TriCore has been approached from two different sides: one is utilization of TriCore DSP Instruction set and the second is DSP programmability in a high-level language like ANSI C. ANSI C/C++ language is not designed for DSP programming and does not have any SIMD constructs.
Page 57: Keywords
TriCore 32-bit Unified Processor Compiler Writer’s Guide Data types can also be extended by the addition of memory-qualifiers. General memory qualifiers are defined as __X and __Y. However TriCore does not distinguish between __X and __Y memories and they should therefore be mapped to the unified memory. The use of __X and __Y memory qualifiers make the code portable to the next generation processors.
Page 58: Appendix A Instruction Pairs For Packed Arithmetic
TriCore 32-bit Unified Processor Compiler Writer’s Guide Appendix A: Instruction Pairs for Packed Arithmetic This appendix serves as a guide for an assembly programmer or a compiler writer, on the use and capability of TriCore's packed arithmetic instructions. Given an instruction, the following table guides a compiler writer or assembly programmer to choose the corresponding packed arithmetic instruction.
Page 59 TriCore 32-bit Unified Processor Compiler Writer’s Guide Instruction type Assembly Corresponding Mnemonic Packed Arithmetic Mnemonic 15 Multiply add -- multi-precision MADD(S).Q MADDM(S).H 16 Multiply add with rounding MADDR(S).Q MADDR(S).H 17 Maximum value MAX.B MAX.H 18 Maximum value unsigned MAX.U MAX.BU MAX.HU 19 Minimum value MIN.B...
Page 60: Appendix B Coding Examples
TriCore 32-bit Unified Processor Compiler Writer’s Guide Appendix B: Coding Examples This Appendix demonstrates how compactly Tricore assembly code can be written for DSP kernels in ISO DSP-C. The following examples show how TriCore uses its compact and efficient instruction set to program the DSP C kernels. #define N 64 void VectorMult(short __fixed X[N], short __fixed Y[N], short __fixed Z[N])
Page 61 TriCore 32-bit Unified Processor Compiler Writer’s Guide LC, (N/4-1) mov.u d6,#0x91ec ld.w d3,[Xptr+]4 addih d6,d6,#0x91ec ld.d e4, [Vptr+]8 preloop: maddrs.h d0,d4,d3,d6ul,#1 ld.d e2,[Xptr+]8 maddrs.h d1,d5,d2,d6ul,#1 ld.d e4,[Vptr+]8 st.d [Zptr+]8,e0 loop LC,preloop Figure A.4: Tricore assembly code of the kernel in Figure A.3 User’s Manual V1.4, 2003-12...
Page 62 TriCore 32-bit Unified Processor Compiler Writer’s Guide User’s Manual V1.4, 2003-12...
Page 63 ((63))
Page 64 : / / w w w . i n f i n e o n . c o m Published by Infineon Technologies AG...

Infineon Technologies TriCore Compiler User Manual

Preface

1 Optimization Strategies

2 Implementation Information

3 Advanced Optimizations

4 DSP Support

Appendix A Instruction Pairs for Packed Arithmetic

Appendix B Coding Examples

Quick Links

Need help?

Questions and answers

Subscribe to Our Youtube Channel

Related Manuals for Infineon Technologies TriCore Compiler

Summary of Contents for Infineon Technologies TriCore Compiler

Table of Contents