Table of Contents 1. Introduction ........................15 Document Description ....................16 What It Is ........................16 What It Is Not ......................16 Information Presentation ..................17 RSP Software Development Tools................19 rspasm........................19 cpp..........................20 m4 ..........................21 buildtask........................21 rsp2elf ........................
Page 5
Revision 1.0 VU Instruction Format ................40 Distinguishing SU and VU Instructions .............. 40 Illegal Instructions ....................40 Execution Pipeline ......................41 RSP Block Diagram ....................41 Mary Jo’s Rules......................43 Register Hazards ..................... 43 SU is Bypassed......................44 Coprocessor 0 ....................... 45 Interrupts, Exceptions, and Processor Status............
Page 6
VU Select Instructions ....................70 Vector Select Examples ..................73 VU Logical Instructions ....................74 VU Divide Instructions ....................75 Reciprocal Table Lookup ..................77 Higher Precision Results..................78 Vector Divide Examples..................78 4. RSP Coprocessor 0 ......................81 Register Descriptions....................82 RSP Point of View ....................
Page 7
Revision 1.0 DMA Full........................96 DMA Wait ........................ 96 DMA Addressing Bits .................... 97 CPU Semaphore ...................... 97 DMA Examples ....................... 97 Controlling the RDP ....................100 How to Control the RDP Command FIFO ............100 Examples ........................ 101 5. RSP Assembly Language ..................... 105 Different From Other MIPS Assembly Languages ..........
Page 13
List of Tables Table 3-1 VU Load/Store Instruction Summary ............49 Table 3-2 VU Computational Instruction Opcode Encoding........57 Table 3-3 VU Computational Instruction Element Encoding ........58 Table 3-4 VU Multiply Instruction Summary............61 Table 3-5 VU Add Type Encoding................67 Table 3-6 VU Select Type Encoding................70 Table 3-7 VU Logical Type Encoding ...............74 Table 3-8...
Introduction The RSP (Reality Signal Processor) is a powerful processor which is part of the RCP (Reality Co-Processor), the heart of the Nintendo Ultra64. The RSP operates in parallel with the host CPU (MIPS R4300i) and dedicated graphics hardware on the RCP. Software running on the RSP (microcode) implements the graphics geometry pipeline (transformations, clipping, lighting, etc.) and audio processing (wavetable synthesis, sampled sound,...
Introduction Document Description What It Is The goal of this document is to enable RSP microcode software development: • Explain architectural details of the RSP. • Explain relevant architectural details of other parts of the RCP. • Describe the RSP from a microcode programmer’s point-of-view. •...
RCP operations (operating system, graphics, audio, etc.). These things are explained in other documents; a thorough background knowledge of the Ultra64 is assumed in this document. Information Presentation Mastery of the information presented in this document will occur slowly, as the information is both voluminous and of tremendous breadth.
Page 18
Introduction • Chapter 2, “RSP Architecture,” describes the architecture of the RSP in great detail. • Chapter 3, “Vector Unit Instructions,” explains the vector unit (VU) instructions, building on the RSP architecture and leading into RSP programming. • Chapter 4, “RSP Coprocessor 0,” describes the RSP’s Coprocessor 0. The RSP Coprocessor 0 controls DMA activity, RDP synchronization, and host CPU interaction.
Revision 1.0 RSP Software Development Tools RSP Software Development Tools A brief introduction to the RSP programming environment will provide a framework for future discussions. The following software tools are typically used for developing RSP code. This section only mentions the critical, RSP-specific tools; other, more general tools (like make and other UNIX tools) are not discussed.
Introduction The rspasm assembler outputs several special files. The root filename for these files can be specified with the -o flag. • <rootname>, is the binary executable code (text section). This file can be loaded into the RSP simulator instruction memory (IMEM) and executed.
Revision 1.0 RSP Software Development Tools The m4 macro processor is a useful tool that can optionally be invoked by the assembler (rspasm -m). If requested, m4 will process the source code after cpp, but before assembly. Although this is a powerful feature, it is not used to build the currently released software.
Introduction Originally developed to verify hardware design and enable parallel hardware and software development, it remains useful for developing RSP microcode in a stand-alone fashion. It has two interfaces, a simple text window interface (rsp) and a fancy window interface (rspg). The window interface supports source-level debugging, which is extremely useful.
Revision 1.0 Chapter 2 RSP Architecture This chapter explains the significant architectural details of the Reality Signal Processor (RSP). It is not intended to be a comprehensive hardware specification, but it does describe the hardware features in sufficient detail for software development. Standing alone, the RSP is an extremely powerful processor;...
(booting, IMEM, DMEM, etc.) Part of the RCP Nintendo 64 Programming Manual, Figure 2-1, reproduced from the illustrates the major functional blocks of the RCP. The RSP, along with the RDP and the IO subsystem, comprise the RCP chip.
Revision 1.0 Overview Block Diagram of the RCP Figure 2-1 IMEM DMEM RDRAM (Rambus Memory) TMEM CPU VI R4300 Audio Game Contollers Video Cartridge R4000 Core The RSP implements an R4000 core instruction set, with additional extensions. The core instruction unit (without the extensions) is referred to as the Scalar Unit (SU).
RSP Architecture Clock Speed The RSP clock runs at 62.5 Mhz. Normally, the CPU and the RCP clock rates run in a 3:2 ratio. Vector Processor The RSP has a vector processor, implemented as MIPS Coprocessor 2. The vector unit (VU) has 32 128-bit wide vector registers (which can also be accessed as 8 vector slices), a vector accumulator (which also has 8 vector slices), and several special-purpose vector control registers.
Revision 1.0 Major R4000 Differences Major R4000 Differences The MIPS R4000 series processors provide a convenient framework for learning about the RSP. Pipeline Depth Pipeline depth varies among MIPS processors and their implementations. The RSP has a pipeline depth of 5. No Interrupts, Exceptions, or Traps The RSP operates as a slave processor.
Revision 1.0 IMEM IMEM The RSP has 4K bytes (1K instructions) of instruction memory (IMEM). Addressing The RSP PC is only 12-bits; only the lowest 12-bits of any address or branch target are used. Other address bits are ignored. Explicitly Managed IMEM must be explicitly managed by the RSP program.
RSP Architecture DMEM The RSP has 4K bytes of data memory (DMEM). Addressing Since DMEM is 4K bytes, only the lowest 12-bits of addresses are used to address DMEM. Other address bits are ignored. Explicitly Managed Resource DMEM must be managed by the RSP program. All RSP loads/stores can only access DMEM;...
Revision 1.0 External Memory Map External Memory Map The RSP memory and control registers map into the host CPU address space as defined in the file rcp.h. This memory map is used by the CPU program to manage the RSP. It is also convenient to use this address map with the RSP assembler (rspasm) and RSP simulator (rsp).
RSP Architecture Scalar Unit Registers The RSP Scalar Unit has 32 general-purpose registers, each 32 bits wide. SU Register Format The RSP has big-endian byte ordering. SU Register Format Figure 2-2 byte 0 byte 1 byte 2 byte 3 Register 0 Register 0 ($0) is a special register.
Revision 1.0 Scalar Unit Registers SU Control Registers RSP control registers are part of Coprocessor 0, and are explained in Chapter 4, “RSP Coprocessor 0,” particularly Table 4-2, “RSP Status Register,” on page 85.
RSP Architecture Vector Unit Registers The RSP Vector Unit has 32 general-purpose vector registers, each 128 bits wide. Depending on the operation, vector registers can be accessed as a single unit, by bytes, or by 16-bit elements corresponding to a vector slice. VU Register Format The RSP has big-endian byte ordering.
Revision 1.0 Vector Unit Registers Instructions can operate on pairs of elements, adding two vectors (8 pairs of numbers), for example. VU registers can also be addressed as scalars, allowing you to add 1 number (the same number) to a vector (8 numbers), for example. scalar halves scalar quarters, Further, registers can be broken into...
RSP Architecture Accumulator Each vector slice has a 48-bit accumulator associated with it. Each 16-bit element of a vector register maps to a vector slice, and therefore to a different 48-bit accumulator. VU Accumulator Format Figure 2-4 high middle byte 0 byte 1 byte 2 byte 3...
Revision 1.0 Vector Unit Registers The low 8 bits are used for most compares (vlt, veq, vne, vge) and merge (vmrg), and all 16 bits are used for the clip compares (vcl, vch, vcr). VCC Register Format Figure 2-5 select compare is TRUE (vs >= vt, for clip compares) vs <= -vt (for clip compares) elem...
RSP Architecture Vector Compare Extension Register (VCE) This 8-bit register contains one bit for each VU slice, set to 1 if the vch comparison was -1, 0 otherwise. Expressed in a high-level language: if ((vs[elem] < 0 && vt[elem] >= 0) || (vs[elem] >= 0 &&...
Revision 1.0 SU and VU Interaction SU and VU Interaction The RSP can execute two instructions per clock cycle, one scalar instruction and one vector instruction. The scalar unit and vector unit operate in parallel. Dual Issue of Instructions at most The instruction fetch cycle can fetch two instructions, one SU and one VU.
RSP Architecture RSP Instruction Set The details of the instruction set can be found in Appendix A, however several important properties are worth mentioning here. Instruction Formats All RSP instructions are implemented within the MIPS R4000 Instruction Set Architecture. SU Instruction Format The SU instructions include all three formats found in the MIPS ISA: immediate (I-type), jump (J-type), and register (R-type).
Revision 1.0 Execution Pipeline Execution Pipeline RSP Block Diagram The RSP execution pipeline is illustrated in Figure 2-8. The scalar unit of the RSP has a five stage pipeline: Instruction Fetch. During this stage, two instruction are fetched and decoded, dual-issuing, if possible. Register Access and Instruction Decode.
Revision 1.0 Execution Pipeline Mary Jo’s Rules Avoiding pipeline stalls in software can be accomplished by understanding the following rules. VU register destination writes 4 cycles later (need 3 cycles between load and use). This applies to vector computational instructions, vector loads, and coprocessor 2 moves (mtc2).
RSP Architecture Obviously, pipeline stalls should be avoided by the programmer (when possible) for the best performance. bypassed Because the SU is (see below), this section only applies to SU registers for loads (and coprocessor moves) and VU registers. SU is Bypassed Bypassing forwarding , or...
Revision 1.0 Coprocessor 0 Coprocessor 0 The RSP coprocessor 0 is thoroughly discussed in Chapter 4, but is mentioned here for completeness. Coprocessor 0 in the MIPS R4000 architecture is designated as the “system control coprocessor”. Since the RSP is a slave processor, the system control functions are greatly reduced, and therefore the usage of coprocessor 0 does not conform to the MIPS R4000 architecture specification.
RSP Architecture Interrupts, Exceptions, and Processor Status Interrupts The RSP does not respond to interrupts, and it can only generate a single interrupt (MI_INTR_SP), triggered by the break instruction. Exceptions No RSP instruction can cause an exception, and there are no exception handling facilities in the RSP.
Revision 1.0 Chapter 3 Vector Unit Instructions Details about each specific instruction are contained in Appendix A, but it is useful to discuss issues common to all of the vector unit instructions, as well as to discuss each related group of vector unit instructions in context. There are two categories of vector unit instructions discussed in this chapter: •...
Vector Unit Instructions VU Loads and Stores Vector loads and stores are scalar unit (SU) instructions used to move the contents of DMEM to and from VU registers (see “VU Register Format” on page 34). VU loads and stores can only access DMEM; they cannot access DRAM.
Revision 1.0 VU Loads and Stores register of a VU load, hardware interlocking will stall the processor until the data arrives. VU stores use an identical pipeline; since accesses to memory Note: always occur in the same VU pipeline stage, a VU store followed by an immediate load from the same memory location is guaranteed to fetch the correct data.
Vector Unit Instructions Memory VU Element Offset Shift Opcode Memory Item Alignment (legal values) Amount 4 8b every 4th, quad+0 to 3 0, 8 << 4 lfv, sfv unssigned (fourth pack) 8 16b (transpose, wrap) quad 0-14 by 2 << 4 ltv, stv, will If an illegal alignment (or element value) is attempted, something...
Revision 1.0 VU Loads and Stores Long, Quad, and Rest Loads and Stores Figure 3-2 Long item: Byte Address 128b alignment Item size Memory word VU register Element Quad item crossing memory word: Byte Address 128b alignment Item size Memory word VU register Element Byte Address...
Vector Unit Instructions Packed Packed loads and stores move memory bytes to or from short elements of the VU register, which are aligned to shorts. They are useful for accessing one, two, or four channel byte image data for VU processing as shorts, such as for VU multiplies.
Revision 1.0 VU Loads and Stores Packed Loads and Stores Figure 3-3 Half 128b alignment Byte Address Memory word VU register Fourth 128b alignment Byte Address Memory word VU register Element Pack, Unsigned Pack 128b alignment Byte Address Memory word VU register...
Vector Unit Instructions The alignment of various pack formats with VU short elements is shown in the Figure 3-4 Packed Load and Store Alignment Figure 3-4 Pack Upack, Half, Fourth Memory byte item VU short element Zero Zero Unsigned pack, half, and fourth items are intended to support unsigned bytes for one, two, or four channel image data.
Revision 1.0 VU Loads and Stores dest_short[ Slice ] = source_short[((Slice + (Element >> 1)) & 0x7)] A transpose is shown in Figure 3-5, with 8x8 block of 8 shorts in 8 VU registers numbered in row order for the 64 elements of the block. The other 14 vector loads and stores needed for the transpose are similar.
Vector Unit Instructions VU Register Moves VU register move instructions follow the general format of MIPS Coprocessor moves (MTC2, MFC2, CTC2, CFC2), with additional interpretation of the lower 11 bits. VU Coprocessor Moves Figure 3-6 COP2 move opcode undefined element The low 16 bits of the SU register are moved from or to the 16 bit element element...
Revision 1.0 VU Computational Instructions VU Computational Instructions The VU computational instructions adhere to the general format of MIPS Coprocessor Operate instructions (COP2). VU Computational Instruction Format Figure 3-7 COP2 element opcode Most VU computational instructions are three operand: operation VD = VS where each operand is one of 32 vector registers.
Vector Unit Instructions Using Scalar Elements of a Vector Register Element encodings are shown in Table 3-3, where x indicates the bit field used to select which element. Scalar elements can be selected within quarters, halves, or the whole vector. Table 3-3 VU Computational Instruction Element Encoding Assembly Element...
Revision 1.0 VU Computational Instructions point-pair in the same half of the vector registers. The register contents and operations are illustrated in Figure 3-8. Scalar Half and Scalar Quarter Vector Register Elements Figure 3-8 vsub $v3, $v1, $v2 (xa-xb) (ya-yb) (za-zb) (xa-xb) (ya-yb)
Page 60
Vector Unit Instructions In the above example (since add is commutative), a slightly different usage of the vector registers could have been used to direct the final result to be in a different element. Replacing: vadd $v3, $v3, $v3[1q] with vadd $v3, $v3, $v3[0q] would leave the final result in element [1h] instead of [0h].
Revision 1.0 VU Multiply Instructions VU Multiply Instructions VU Multiply Opcode Encoding Figure 3-9 format VU multiply instructions perform various multiplies, specified by the following fields: Element: Vector or scalar element of When == 1, Accumulate the product, otherwise round the product and load the accumulator.
Page 62
Vector Unit Instructions Prod S, T signed Round Value Result Clamping Instructions Shift 1 1 0 uns, sign b15-0 sign, b31-msb vmudn, vmadn 1 1 1 sign, sign << 16 b31-16 sign, b31-msb vmudh, vmadh vmulf and vmulu support operands with 15 fraction bits, and differ only in whether the result is clamped signed or unsigned.
Page 63
Revision 1.0 VU Multiply Instructions Rounding is performed for single precision multiplies by adding the appropriate rounding value (as dictated by the format) to the accumulator. Clamping (saturation) is performed by testing certain accumulator bits above the 16 bit result field, and substituting maximum or minimum 16 bit signed or unsigned numbers, as dictated by the format.
Vector Unit Instructions Double precision operands use a register pair, one register containing the upper signed 16 bits and another containing the low unsigned 16 bits. Double precision multiplication is illustrated in Figure 3-10. Figure 3-10 Double-precision VU Multiply VS and VT operands High 16b signed int, Low 16b unsigned frac vmudl SL * TL >>...
Page 65
Revision 1.0 VU Multiply Instructions Vector Multiply Examples The following code fragments illustrate various multiplies. In this section, the following notation is used: • I is a signed 16-bit integer. • F is an unsigned 16-bit fraction. • IF is a 32-bit number, with the signed upper 16 bits contained in one register, and the unsigned lower 16 bits contained in a second register.
Page 66
Vector Unit Instructions vmadm res_int, s_int, t_frac vmadn res_frac, dev_null, dev_null[0] IxI: # single precision integer multiply: # I * I = I vmudh res_int, s_int, t_int IxF: # single precision multiply: # I * F = IF vmudm res_int, s_int, t_frac vmadn res_frac, dev_null, dev_null[0] Other combinations are left as an exercise to the reader.
Revision 1.0 VU Add Instructions VU Add Instructions Figure 3-11 VU Add Opcode Encoding type The VU add instructions perform various types of adds, specified by the following fields: Element : Vector or scalar element of (except vsar where it selects the accumulator portion).
Vector Unit Instructions Type Instruction 1 1 0 1 vsar 1 1 1 0 reserved 1 1 1 1 reserved The VU adds are short (16 bit) add operations; they clear VCO and clamp to 16 bit signed values. vadd uses VCO as carry in, vsub uses VCO as borrow in, and vabs ignores VCO: VD = VS + VT vadd:...
Page 69
Revision 1.0 VU Add Instructions • I is a signed 16-bit integer. • F is an unsigned 16-bit fraction. • IF is a 32-bit number, with the signed upper 16 bits contained in one register, and the unsigned lower 16 bits contained in a second register.
Vector Unit Instructions VU Select Instructions The VU select operations compare pairs of vector elements and choose which one to write, based on the outcome of the test. Figure 3-12 VU Select Opcode Encoding 1 0 0 type Instruction fields are: Element : Vector or scalar element of Type...
Page 71
Revision 1.0 VU Select Instructions VS!= VT vne: VS >= VT vge: Clip test, single precision or high half of double vch: precision. Clip test, low half of double precision. vcl: 1’s complement clamp. vcr: VD = VS or VT selected by VCC, VCO is ignored. vmrg: To implement comparisons which are not supplied, the ‘vle’...
Page 72
Vector Unit Instructions Note that vmrg uses the low 8 bits of VCC, the upper 8 as set by vcl/vcr are ignored. The results of a compare in VCC are available to a following vmrg instruction using VCC without pipeline delays. VCC can also be accessed by the SU with VU move instructions (ctc2/cfc2) for other processing such as accumulation, branching, or patterning.
Revision 1.0 VU Select Instructions For single precision vch not followed by a vcl, VCO must be set Note: before another compare (by a move, add, or compare whose results are not meaningful). The vcr instruction is similar to vcl, except that is a 1’s complement instead of 2’s complement number, such as for clamping to a power of 2.
Vector Unit Instructions VU Logical Instructions The VU logical instructions perform the usual bit-wise logical operations on writing the result to Figure 3-13 VU Logical Opcode Encoding 1 0 1 type Instruction fields are: Element : Vector or scalar element of Type : One of the following operations: Table 3-7 VU Logical Type Encoding...
Revision 1.0 VU Divide Instructions VU Divide Instructions The VU divide instructions compute the reciprocal of a scalar element of a vector register. Figure 3-14 VU Divide Opcode Encoding 1 1 0 type The divide instructions are two operand, . An element specification must be provided for each operand, selecting the source and destination elements, for example: vmov $v1[5], $v2[0]...
Vector Unit Instructions The reciprocal (rcp) or reciprocal of the square root (rsq) of the scalar element of is computed by table lookup and written to the scalar element The scalar element of is selected by the register number (0-7). Not the contents of , but the instruction field bits.
Revision 1.0 VU Divide Instructions Type vt[element] vd[vs] lookup source and previous, write result vrcpl, vrsql Reciprocal Table Lookup The results are computed by a table lookup using 10 bits of precision. The input is shifted up to remove leading 0’s (or 1’s) (actually, the first non-leading digit is also removed, since we know what it is) and the next 10 bits are used to index into the reciprocal table.
Vector Unit Instructions so we need to also take the sqrt of the exponent: result ------------------ - -- - so the result does have the same radix point as the input. Higher Precision Results Algorithms which require higher precision can perform Newton-Raphson iteration on the result, such as: R’...
Page 79
Revision 1.0 VU Divide Instructions • _frac is a named vector register holding an unsigned 16 bit fraction. • dev_null is a named vector register containing all zeros. A single precision reciprocal: vrcp sres_frac[0], s_int[0] vrcph sres_int[0], dev_null[0] A double precision reciprocal: vrcph sres_int[0], s_int[0] vrcpl...
Revision 1.0 Chapter 4 RSP Coprocessor 0 This chapter describes the RSP Coprocessor 0, or system control coprocessor. The RSP Coprocessor 0 does not perform the same functions or have the same registers as the R4000-series Coprocessor 0. In the RSP, Coprocessor 0 is used to control the DMA (Direct Memory Access) engine, RSP status, RDP status, and RDP I/O.
RSP Coprocessor 0 Register Descriptions RSP Point of View RSP Coprocessor 0 registers are programmed using the mtc0 and mtf0 instructions which move data between the SU general purpose registers and the coprocessor 0 registers. Table 4-1 RSP Coprocessor 0 Registers Register Name Defined in Access...
Revision 1.0 Register Descriptions This register holds the RSP IMEM or DMEM address for a DMA transfer. a=0: DMEM a=1: IMEM IMEM or DMEM address On power-up, this register is 0x0. This register holds the DRAM address for a DMA transfer. This is a physical memory address.
RSP Coprocessor 0 The three fields of this register are used to encode arbitrary transfers of length rectangular areas of DRAM to/from contiguous I/DMEM. is the number of bytes per line to transfer, count is the number of lines, and skip the line stride, or skip value between lines.
Revision 1.0 Register Descriptions This register holds the RSP status. Table 4-2 RSP Status Register Access field Description Mode RSP is halted. RSP has encountered a break instruction. DMA is busy. DMA is full. IO is full. RSP is in single-step mode. Interrupt on break.
RSP Coprocessor 0 The ‘broke’, ‘single-step’, and ‘interrupt on break’ bits are used by the debugger. The signal bits can be used for user-defined synchronization between the CPU and the RSP. On power-up, this register contains 0x0001. When writing the RSP status register, the following bits are used. Table 4-3 RSP Status Write Bits Description clear HALT.
Page 87
Revision 1.0 Register Descriptions Description set SIGNAL 0. (0x00000400) clear SIGNAL 1. (0x00000800) set SIGNAL 1. (0x00001000) clear SIGNAL 2. (0x00002000) set SIGNAL 2. (0x00004000) clear SIGNAL 3. (0x00008000) set SIGNAL 3. (0x00010000) clear SIGNAL 4. (0x00020000) set SIGNAL 4. (0x00040000) clear SIGNAL 5.
Page 88
RSP Coprocessor 0 This register maps to bit 3 of the RSP status register, DMA_FULL. It is read only. On power-up, this register is 0x0. This register maps to bit 2 of the RSP status register, DMA_BUSY. It is read only.
Revision 1.0 Register Descriptions as either a 24 bit physical DRAM address, or a 12 bit DMEM address (see $c11). RDP Command Start On power-up, this register is undefined. This register holds the RDP command buffer END address. Depending on the state of the RDP STATUS register, this address is interpreted by the RDP as either a 24 bit physical DRAM address, or a 12 bit DMEM address (see $c11).
RSP Coprocessor 0 register, this address is interpreted by the RDP as either a 24 bit physical DRAM address, or a 12 bit DMEM address (see $c11). RDP Command Current On power-up, this register is 0x0. $c11 This register holds the RDP status. Table 4-4 RDP Status Register Access field...
Revision 1.0 Register Descriptions Access field Description Mode RDP COMMAND buffer is ready. RDP DMA is busy. RDP COMMAND END register is valid. RDP COMMAND START register is valid. When bit 0 (XBUS_DMEM_DMA) is set, the RDP command buffer will receive data from DMEM (see $c8, $c9, $c10).
RSP Coprocessor 0 Description clear PIPE COUNTER. (0x0080) clear COMMAND COUNTER. (0x0100) clear CLOCK COUNTER (0x0200) $c12 This register holds a clock counter, incremented on each cycle of the RDP clock. This register is READ ONLY. RDP Clock Counter On power-up, this register is undefined. $c13 This register holds a RDP command buffer busy counter, incremented on each cycle of the RDP clock while the RDP command buffer is busy.
Revision 1.0 Register Descriptions $c14 This register holds a RDP pipe busy counter, incremented on each cycle of the RDP clock that the RDP pipeline is busy. This register is READ ONLY. RDP Pipe Busy Counter On power-up, this register is undefined. $c15 This register holds a RDP TMEM load counter, incremented on each cycle of the RDP clock while the TMEM is loading.
RSP Coprocessor 0 Bit patterns for READ and WRITE access are the same as described in the previous section. Table 4-6 RSP Coprocessor 0 Registers (CPU VIEW) Register Access Address Description Number Mode I/DMEM address for DMA. 0x04040000 DRAM address for DMA. 0x04040004 ...
RSP Coprocessor 0 All data operated on by the RSP must first be DMA’d into DMEM. RSP programs can also use DMA to load microcode into IMEM. loading microcode on top of the currently executing code at the PC Note: will result in undefined behavior.
Revision 1.0 DMA Addressing Bits Since all DMA accesses must be 64-bit aligned, the lower three bits of source and destination addresses are ignored and assumed to be all 0’s. Transfer lengths are encoded as (length - 1), so the lower three bits of the length are ignored and assumed to be all 1’s.
RSP Coprocessor 0 Controlling the RDP The RDP has an independent DMA engine which reads commands from DMEM or DRAM into the command buffer. The RDP command buffer registers are programmed to direct the RDP from where to read the command data.
Revision 1.0 Controlling the RDP Examples The XBUS is a direct memory path between the RSP (and DMEM) and the RDP. This example uses a portion of DMEM as a circular FIFO to send data to the RDP. This example uses an “open” and “close” interface; the “open” reserves space in the circular buffer, then the data is written, the “close”...
RSP Coprocessor 0 OutputOpen Function Using the XBUS Figure 4-5 .name dmemp, .name dramp, .name outsz, $18 # caller sets to max size of write # open(size) - wait for size avail in ring buffer. - possibly handle wrap - wait for ‘current’ to get out of the way .ent OutputOpen...
Revision 1.0 Controlling the RDP After calling OutputOpen, the program writes the RDP commands to DMEM, advancing outp. Once the complete RDP command is written to DMEM, OutputClose is called. OutputClose Function Using the XBUS Figure 4-6 #################################################### # OutputClose #################################################### .ent OutputClose...
Revision 1.0 Chapter 5 RSP Assembly Language This chapter describes the RSP Assembly Language, as accepted by the rspasm assembler. Although different in many fundamental ways, there are some similarities “MIPSPro with the MIPS assembly language, described in the document Assembly Language Programmer’s Guide”...
RSP Assembly Language Different From Other MIPS Assembly Languages Why? Although the RSP uses the R4000 architecture, it is a specialized processor designed for a special purpose. The assembly language is similarly restricted, and does not require the full richness of the MIPS Assembly Language.
Revision 1.0 Syntax Syntax Tokens The assembler has these tokens: • identifiers • constants • operators The assembler lets you put whitespace (blank characters, tabs, or newlines) anywhere between tokens. Whitespace must separate adjacent identifiers or constants that are not otherwise separated (by an expression operator, for instance).
RSP Assembly Language • Hexadecimal constants, which consist of the characters 0x (or 0X) followed by a sequence of hexadecimal digits [0123456789abcdefABCDEF]*. • Octal constants, which consist of a leading zero followed by a sequence of octal digits [01234567]*. • String constants, which consist of any sequence of alphanumeric characters (except double quotes) enclosed in double quotes.
Revision 1.0 Syntax • ; comments. Anything from the ‘;’ to the end of the line is ignored. Program Sections An RSP program has only two sections, a text section (.text) and a data section (.data). The text section is assembled in sequence, with only one base address for assembly (see .text directive).
RSP Assembly Language If the assembly source code is passed through another program (such as a macro pre-processor like ), additional reserved keywords may be implied, if they are reserved by that program. Expressions An expression is a sequence of symbols that represent a value. All assembler expressions evaluate to an integer data type.
Revision 1.0 Syntax Table 5-1 Expression Operators Operator Meaning Minus (unary) Plus (unary) Precedence Expressions can be grouped with parentheses (recommended) or you can rely on the following precedence rules: Table 5-2 Expression Operator Precedence least binding, lowest precedence: binary binary *,/,%,<<,>>,^,&,| unary +,-,~ most binding, highest precedence...
RSP Assembly Language expression to a temporary identifier using the .symbol directive, by itself then use this temporary identifier to initialize a data directive. Throughout this document, expressions that cannot contain identifiers are referred to as iexpressions (integer expressions). Registers The syntax for referring to the scalar unit (SU) registers is a dollar sign ($), followed by an integer in the range of 0...31.
Revision 1.0 Syntax Vector Register Element Syntax In some circumstances, a scalar element of a vector register may be specified. These circumstances include the target register of most vector computational instructions and the source/destination register of all vector loads, stores, and moves. For vector computational instructions, a vector register element syntax is one of: •...
RSP Assembly Language Assembly Directives Directives, or ‘pseudo-opcodes’ are instructions to the assembler that are interpreted at compile time. They do not generate executable machine instructions. They exist to initialize data, direct the compilation, provide error checking, etc. lowercase A directive is a period (.) followed by a sequence of alphabetic characters.
Revision 1.0 Assembly Directives .byte .byte iexpression One byte of the data section is allocated and initialized to the value of the iexpression Since one byte is not sufficient to hold the address of any symbol in DMEM identifier or IMEM, an is not permitted.
RSP Assembly Language .end .end identifier [, expression] End a procedure. The assembler outputs debugging information for the debugger, including the beginning and ending locations of procedures. .ent .ent identifier [, expression] Begin a procedure. The assembler outputs debugging information for the debugger, including the beginning and ending locations of procedures.
Revision 1.0 Assembly Directives .print .print string-constant [, expression][, expression]... The quoted string constant is printed to stderr during assembly. The string constant may contain C-like numeric printf conversions (%d,%x, expressions etc.) and the will be evaluated and printed to stderr. expressions A maximum of four are permitted per .print directive.
RSP Assembly Language Switch to the text section. All program instructions must be contained in the text section. expression If the optional is present, it is evaluated and used as the base address for assembling the program. Only the least significant 12 bits of the base address is used, since IMEM is only 4K bytes.
Revision 1.0 BNF Specification of the RSP Assembly Language BNF Specification of the RSP Assembly Language This section presents a formal specification of the RSP assembly language using a Backus-Naur Form (BNF). Comments are not shown because they are removed by the parser during token scanning. ...
Revision 1.0 Chapter 6 Advanced Information This chapter expands on some advanced topics, such as DMEM usage, RSP performance, code overlays, and the CPU-RSP relationship. Examples and information presented in this chapter are often one of many possible approaches, the reader is encouraged to treat this chapter as inspiration, not rigorous instruction.
Advanced Information DMEM Organization and Usage Planning the layout of DMEM is an essential step of writing an RSP program. A convenient DMEM layout can save precious instructions and lead to a more optimized and bug-free program. There are typically parts of DMEM which can be or need to be allocated and initialized at compile-time;...
Revision 1.0 DMEM Organization and Usage It can be convenient to reserve a VU register to hold an entire vector of constants, available for use in vector computational instructions. Labels in DMEM Labels can be used in the data section to later reference offsets for the purposes of loading or storing things.
Advanced Information Performance Tips Assembly language optimizations or vector processing tricks are beyond the scope of this document, however it is worthwhile to mention a few issues specifically relating to the RSP architecture. Dual Execution The RSP executes up to one Scalar Unit (SU) instruction and one Vector Unit (VU) instruction per clock cycle;...
Page 129
Revision 1.0 Performance Tips for loops Programming constructs like: for (i=0; i<n; i++) {} perform the same thing on a bunch of data. This is exactly a “vector” operation. conversely, switch Programming constructs which separate data (switch(), if()), performing different tasks in different data situations do not vectorize well.
Advanced Information there are, and this number is not variable. (2) we have severe code space constraints. Abstracting the vector unit size has severe implications on the vector code start-up. The point of this discussion is to observe that the hardware architecture is clearly visible in the microcode.
Revision 1.0 Performance Tips vadd $v1, $v2, $v3 vadd $v4, $v4, $v1 In this example, the second vadd instruction could not execute until the first data dependency vadd has completed and written back its result. There is a on register $v1. The result will be a pipeline stall that will effectively serialize the vector code, seriously dampening its performance.
Advanced Information In this fictitious example, we have theoretically improved our program’s speed by (num_pts - 4)*(time to do the translation). A big improvement! This technique is common to help vectorizing compilers “recognize” loops that can be vectorized. The compiler will actually break up the loop into multiple vector operations the size of the number of vector elements.
Revision 1.0 Performance Tips code which decides which attributes are necessary, we always compute them all and only output the ones we are interested in. This approach also saves precious IMEM space. Profiling RSP Code The RSP simulator can help profile your code, it can show pipeline stalls, load delays, and DMA wait states.
Advanced Information Real-time Clock Watching on the RSP Figure 6-1 In the RSP microcode: # Checkpoint the clock before the critical section: mfc0 $1, $c12 $1, 0($0) (Perform the critical section) # Checkpoint the clock after the critical section: mfc0 $1, $c12 $2, 0($0) $1, $1, $2...
Revision 1.0 Microcode Overlays Microcode Overlays One of the challenges of RSP programming is working within the limited instruction memory. IMEM is an explicitly managed resource; you are free to load new code as you see fit. swap RSP microcode loading can be divided into two situations: a , initiated by the host CPU, which loads the entire IMEM while the RSP is halted, and overlay...
Advanced Information RSP Assembler Tricks The RSP assembler rspasm has several features designed to assist developing microcode overlays. IMEM Alignment Alignment directives like .bound and .align can be used in the text section to ensure that overlay destinations are 64-bit aligned, as required by the DMA engine. DMEM Initialization Initialization directives like .word and .half can be used to create a table of information necessary to perform...
Page 137
Revision 1.0 Microcode Overlays Operation Figure 6-2 buildtask Output Object Text Section Output Object Data Section ucode data -d offset offset 0 object 0 offset 0 size 0 dest 0 offset 1 size 1 dest 1 ucode offset 2 object 0 size 2 dest 2 size 0...
Advanced Information With this information, a DMA transaction can be programmed to load an overlay into IMEM. Overlay Example To see exactly how this works, let’s examine the source code and Makefile for a simple example. Overlay Makefile ####################################################### # use the RSP linker ‘buildtask’ to construct the tasks # from the objects.
Revision 1.0 Microcode Overlays notice the usage of the -S flag used when compiling newt.u in order to access the external symbols of gspLine3D.u. The -f argument passed to buildtask prevents concatenation of the newt.dat section; this data section is redundant (any static data needed for newt.u is planned for and included in gspLine3D.u).
Revision 1.0 Microcode Overlays Overlay Decision Code Deciding when to perform an overlay is specific to each program and overlay function and therefore an example is not necessary. In this case, we always perform the overlay, since we are loading it over the RSP boot microcode (reclaiming precious IMEM space!) Overlay DMA Code Actually overlaying the new microcode is the same as any other DMA...
Advanced Information Controlling the RSP from the CPU The operating system running on the CPU includes facilities to control the RSP. The major function calls and some RSP details are explained in this section. Starting RSP Tasks The man page for osSpTaskStart() explains the CPU-side details of managing the RSP.
Revision 1.0 Controlling the RSP from the CPU Hidden OS Functions There are undocumented OS functions to access the RSP from the CPU. These functions should be used in the regular course of game programming; their use may interfere with other core OS functionality. They can be useful for RSP program development, particularly post-mortem analysis of RSP state.
Page 144
Advanced Information __osSpRawWriteIo() __osSpRawWriteIo(u32 devAddr, u32 data) Perform a 32-bit programmed IO write to RSP memory address space. Note that devAddr must be 32-bit aligned. If the interface is busy, return a -1 and abort the operation. __osSpGetStatus() __osSpGetStatus(void) Return the RSP status register. __osSpSetStatus() void __osSpSetStatus(u32 data)
Revision 1.0 Microcode Debugging Tips Microcode Debugging Tips There are two different environments for debugging microcode: (1) the RSP simulator (rsp or rspg) and (2) the coprocessor view of Gameshop (gvd). Each tool has its advantages; Gameshop is discussed in separate documentation.
Page 146
Advanced Information guDumpGbiDL() This library function can be called directly from the game to dump the necessary pieces back out to the Indy. It uses the rmonPrintf() and creates a (potentially very large) ASCII file that can be read by gbi2mem. guDumpGbiDL() works by saving the OSTask structure, the microcode, the display list, and traversing the display list following any data (textures, matrices, vertices, etc.)
Revision 1.0 RSP Yielding RSP Yielding One of the more complex issues of synchronization between the CPU and yielding the RSP is the concept of . The motivation for yielding is discussed at length in higher-level documentation; some of the implementation details are discussed here.
Advanced Information Requesting a Yield An application requests an RSP task to yield by calling osSpTaskYield(). This function sets the Coprocessor 0 Status Register bit SP_SET_YIELD, which is #define’d as SIG0 in rcp.h. Checking for Yield The microcode checks periodically for a yield request. It would be inefficient to check too often, but it would also be dangerous to not check often enough, possibly detecting the yield too late.
Revision 1.0 RSP Yielding Saving a Yielded Process After requesting a yield, the host CPU must wait for the RSP task to finish and verify that it actually yielded. It might also modify internal state, so that the yielded task can be restarted. Restarting a Yield Process Restarting a previously yielded task is conceptually simple;...
Appendix A RSP Instruction Set Details This appendix describes the machine-language format of the RSP instructions and formally describes the behavior of each instruction. Since the RSP instruction set conforms to the MIPS ISA, the format and notation of this appendix is the same as Appendix A in the book “MIPS R4000 Microprocessor User’s Manual”...
Table A-1RSP Instruction Operation Notations Symbol Meaning Assignment. Bit string concatenation. Replication of bit value into a -bit string. Note: is always a single-bit value. Selection of bits through of bit string y...z Little-endian bit notation is always used. If is less than , this expression is an empty (zero length) bit...
Revision 1.0 Table A-1RSP Instruction Operation Notations Symbol Meaning ACC[e] Vector Unit Accumulator, element e. The ACC has 8 elements each 48 bits wide. dmem[x] DMEM contents beginning at byte address x. T+i: Indicates the time steps between operations. Each of the statements within a time step are defined to be executed in sequential order (as modified by conditional and loop constructs).
Page 154
Example #1: GPR[rt] immediate || 0 Sixteen zero bits are concatenated with an immediate value (typically 16 bits), and the 32-bit string (with the lower 16 bits set to zero) is assigned to General-Purpose Register rt. Example #2: (immediate || immediate 15...0 Bit 15 (the sign bit) of an immediate value is extended for...
Page 156
11 10 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Format: rd, rs, rt Description: The contents of general register and the contents of general register are added to form the result.
Page 157
Revision 1.0 ADDI ADDI Add Immediate ADDI immediate 0 0 1 0 0 0 Format: addi rt, rs, immediate Description: immediate The 16-bit is sign-extended and added to the contents of general register to form the result. The result is placed into general register Since the RSP does not signal an overflow exception for ADDI, this command behaves identically to ADDIU.
Page 158
ADDIU ADDIU Add Immediate Unsigned ADDIU immediate 0 0 1 0 0 1 Format: addiu rt, rs, immediate Description: immediate The 16-bit is sign-extended and added to the contents of general register to form the result. The result is placed into general register Since the RSP does not signal an overflow exception for ADDI, this command behaves identically to ADDI.
Page 159
Revision 1.0 ADDU ADDU Add Unsigned 11 10 SPECIAL ADDU 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 Format: addu rd, rs, rt Description: The contents of general register and the contents of general register are added to form the result.
Page 160
11 10 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 Format: rd, rs, rt Description: The contents of general register are combined with the contents of general register in a bit-wise logical AND operation.
Page 161
Revision 1.0 ANDI ANDI And Immediate immediate ANDI 0 0 1 1 0 0 Format: andi rt, rs, immediate Description: immediate The 16-bit is zero-extended and combined with the contents of general register in a bit-wise logical AND operation. The result is placed into general register Operation: GPR[rt] ...
Page 162
Branch On Equal offset 0 0 0 1 0 0 Format: rs, rt, offset Description: A branch target address is computed from the sum of the address of the instruction in the delay slot offset and the 16-bit , shifted left two bits and sign-extended. The contents of general register the contents of general register are compared.
Page 163
Revision 1.0 Branch On Greater Than BGEZ BGEZ Or Equal To Zero offset REGIMM BGEZ 0 0 0 0 0 1 0 0 0 0 1 Format: bgez rs, offset Description: A branch target address is computed from the sum of the address of the instruction in the delay slot offset and the 16-bit , shifted left two bits and sign-extended.
Page 164
Branch On Greater Than BGEZAL BGEZAL Or Equal To Zero And Link offset REGIMM BGEZAL 0 0 0 0 0 1 1 0 0 0 1 Format: bgezal rs, offset Description: A branch target address is computed from the sum of the address of the instruction in the delay slot offset, and the 16-bit shifted left two bits and sign-extended.
Page 165
Revision 1.0 BGTZ BGTZ Branch On Greater Than Zero offset BGTZ 0 0 0 1 1 1 0 0 0 0 0 Format: bgtz rs, offset Description: A branch target address is computed from the sum of the address of the instruction in the delay slot offset and the 16-bit , shifted left two bits and sign-extended.
Page 166
Branch on Less Than BLEZ BLEZ Or Equal To Zero offset BLEZ 0 0 0 1 1 0 0 0 0 0 0 Format: blez rs, offset Description: A branch target address is computed from the sum of the address of the instruction in the delay slot offset and the 16-bit , shifted left two bits and sign-extended.
Page 167
Revision 1.0 BLTZ BLTZ Branch On Less Than Zero offset REGIMM BLTZ 0 0 0 0 0 1 0 0 0 0 0 Format: bltz rs, offset Description: A branch target address is computed from the sum of the address of the instruction in the delay slot offset and the 16-bit , shifted left two bits and sign-extended.
Page 168
Branch On Less Than BLTZAL BLTZAL Zero And Link offset REGIMM BGEZAL 0 0 0 0 0 1 1 0 0 0 1 Format: bltzal rs, offset Description: A branch target address is computed from the sum of the address of the instruction in the delay slot offset, and the 16-bit shifted left two bits and sign-extended.
Page 169
Revision 1.0 Branch On Not Equal offset 0 0 0 1 0 1 Format: bne rs, rt, offset Description: A branch target address is computed from the sum of the address of the instruction in the delay slot offset, and the 16-bit shifted left two bits and sign-extended.
Page 170
BREAK BREAK Breakpoint code BREAK SPECIAL 0 0 0 0 0 0 0 0 1 1 0 1 Format: break Description: A breakpoint occurs, halting the RSP and setting the SP_STATUS_BROKE bit in the RSP status register. When the SP_STATUS_INTR_BREAK is set in the RSP status register, the RSP interrupt is signaled (MI_INTR_SP).
Page 171
Revision 1.0 Move Control From CFC2 CFC2 Coprocessor 2 (VU) 11 10 COP2 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Format: cfc2 rt, rd Description: The contents of coprocessor 2 (VU) control register are loaded into general register Operation:...
Page 172
CTC2 CTC2 Move Control to Coprocessor 2 (VU) 11 10 COP2 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 Format: ctc2 rt, rd Description: The contents of general register are loaded into control register of the VU (coprocessor unit Operation:...
Page 173
Revision 1.0 Jump target 0 0 0 0 1 0 Format: j target Description: The 26-bit target address is shifted left two bits and combined with the high-order bits of the address of the delay slot. The program unconditionally jumps to this calculated address with a delay of one instruction.
Page 174
Jump And Link target 0 0 0 0 1 1 Format: jal target Description: The 26-bit target address is shifted left two bits and combined with the high-order bits of the address of the delay slot. The program unconditionally jumps to this calculated address with a delay of one instruction.
Page 175
Revision 1.0 JALR JALR Jump And Link Register 11 10 SPECIAL JALR 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 Format: jalr rs jalr rd, rs Description: The program unconditionally jumps to the address contained in general register , with a delay of...
Page 176
Jump Register 21 20 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 Format: Description: The program unconditionally jumps to the address contained in general register , with a delay of one instruction.
Page 177
Revision 1.0 Load Byte offset base 1 0 0 0 0 0 Format: lb rt, offset(base) Description: offset base The 16-bit is sign-extended and added to the contents of general register to form a DMEM address. The contents of the byte at the DMEM location specified by the effective address are sign-extended and loaded into general register Since DMEM is only 4K bytes, only the lower 12 bits of the effective address are used.
Page 178
Load Byte Unsigned offset base 1 0 0 1 0 0 Format: lbu rt, offset(base) Description: offset base The 16-bit is sign-extended and added to the contents of general register to form a DMEM address. The contents of the byte at the DMEM location specified by the effective address are zero-extended and loaded into general register Since DMEM is only 4K bytes, only the lower 12 bits of the effective address are used.
Page 179
Revision 1.0 Load Byte into Vector Register LWC2 base element offset 1 1 0 0 1 0 0 0 0 0 0 Format: lbv vt[element], offset(base) Description: This instruction loads a byte (8 bits) from the effective address of DMEM into byte of vector register offset...
Page 180
Load Double into Vector Register LWC2 base element offset 1 1 0 0 1 0 0 0 0 1 1 Format: ldv vt[element], offset(base) Description: This instruction loads a double (64 bits) from the effective address of DMEM into vector register starting at byte offset The effective address is computed by shifting the...
Page 181
Revision 1.0 Load Packed Fourth into Vector Register LWC2 base element offset 1 1 0 0 1 0 0 1 0 0 1 Format: lfv vt[element], offset(base) Description: This instruction loads every fourth byte of a 128-bit word into a VU register element. Since lfv only element moves four bytes, the field selects the upper or lower group of four destination register...
Page 182
Operation: Addr ((offset || offset ) + GPR[base] 15...0 for i in 0...3 Addr = Addr + i * 4 (0 VR[vt][element + i*2] || dmem[Addr || 0 ) 15...0 11...0 7...0 endfor Exceptions: None...
Page 183
Revision 1.0 Load Halfword offset base 1 0 0 0 0 1 Format: lh rt, offset(base) Description: The 16-bit offset is sign-extended and added to the contents of general register base to form a DMEM address. The contents of the halfword at the DMEM location specified by the effective address are sign-extended and loaded into general register Since DMEM is only 4K bytes, only the lower 12 bits of the effective address are used.
Page 184
Load Halfword Unsigned offset base 1 0 0 1 0 1 Format: lhu rt, offset(base) Description: offset base The 16-bit is sign-extended and added to the contents of general register to form a DMEM address. The contents of the halfword at the DMEM location specified by the effective address are zero-extended and loaded into general register Since DMEM is only 4K bytes, only the lower 12 bits of the effective address are used.
Page 185
Revision 1.0 Load Packed Half into Vector Register LWC2 base element offset 1 1 0 0 1 0 0 1 0 0 0 Format: lhv vt[0], offset(base) Description: This instruction loads every second byte of a 128-bit word into a VU register element. The bytes are loaded with their MSB positioned at bit 14 in the register element.
Page 186
Operation: Addr ((offset || offset ) + GPR[base] 15...0 for i in 0...7 Addr = Addr + i * 2 (0 VR[vt][i*2] || dmem[Addr || 0 ) 15...0 11...0 7...0 endfor Exceptions: None...
Page 187
Revision 1.0 Load Long into Vector Register LWC2 base element offset 1 1 0 0 1 0 0 0 0 1 0 Format: llv vt[element], offset(base) Description: This instruction loads a long (32 bits) from the effective address of DMEM into vector register starting at byte offset The effective address is computed by shifting the...
Page 188
Load Packed Bytes into Vector Register LWC2 base element offset 1 1 0 0 1 0 0 0 1 1 0 Format: lpv vt[0], offset(base) Description: This instruction loads eight consecutive bytes into the upper bytes of eight VU register elements. See Figure 3-3, “Packed Loads and Stores,”...
Page 189
Revision 1.0 Load Quad into Vector Register LWC2 base offset 1 1 0 0 1 0 0 0 1 0 0 Format: lqv vt[0], offset(base) Description: This instruction loads a byte-aligned quad word (128 bits) from the effective address of DMEM up to the 128 bit boundary, that is (address) to ((address &...
Page 190
Load Quad (Rest) into Vector Register LWC2 base offset 1 1 0 0 1 0 0 0 1 0 1 Format: lrv vt[0], offset(base) Description: This instruction loads a byte-aligned quad word from the 128 bit aligned boundary up to the byte address, that is (address &...
Page 191
Revision 1.0 Load Short into Vector Register LWC2 base element offset 1 1 0 0 1 0 0 0 0 0 1 Format: lsv vt[element], offset(base) Description: This instruction loads a short (16 bits) from the effective address of DMEM into vector register starting at byte offset The effective address is computed by shifting the...
Page 192
Load Transpose into Vector Register LWC2 base element offset 1 1 0 0 1 0 0 1 0 1 1 Format: ltv vt[element], offset(base) Description: This instruction loads an aligned 128 bit memory word into a group of 8 vector registers, scattering this memory word into a diagonal vector of shorts in 8 VU registers.
Page 193
Revision 1.0 Load Upper Immediate immediate 0 0 1 1 1 1 0 0 0 0 0 Format: lui rt, immediate Description: immediate The 16-bit is shifted left 16 bits and concatenated to 16 bits of zeros. The result is placed into general register Operation: GPR[rt] ...
Page 194
Load Unsigned Packed into Vector Register LWC2 base element offset 1 1 0 0 1 0 0 0 1 1 1 Format: luv vt[0], offset(base) Description: This instruction loads eight consecutive bytes into the upper bytes of eight VU register elements. The bytes are loaded with their MSB positioned at bit 14 in the register element.
Page 196
Load Word offset base 1 0 0 0 1 1 Format: lw rt, offset(base) Description: offset base The 16-bit is sign-extended and added to the contents of general register to form a DMEM address. The contents of the word at the DMEM location specified by the effective address are loaded into general register Since DMEM is only 4K bytes, only the lower 12 bits of the effective address are used.
Page 197
Revision 1.0 Move From MFC0 MFC0 System Control Coprocessor 11 10 COP0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Format: mfc0 rt, rd Description: The contents of coprocessor register of the CP0 are loaded into general register Operation: data ...
Page 198
MFC2 MFC2 Move From Coprocessor 2 (VU) 11 10 COP2 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 Format: mfc2 rt, vd[e] Description: The 16-bit contents at byte element of VU register are sign-extended and loaded into general register Operation:...
Page 199
Revision 1.0 Move To MTC0 MTC0 System Control Coprocessor 11 10 COP0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 Format: mtc0 rt, rd Description: The contents of general register are loaded into coprocessor register of CP0.
Page 200
MTC2 MTC2 Move To Coprocessor 2 (VU) 11 10 COP2 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 Format: mtc2 rt, vd[e] Description: The least significant 16 bits of general register are loaded at byte element of VU register Operation:...
Page 201
Revision 1.0 Null Operation 11 10 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Format: Description: This instruction does nothing; it modifies no registers and changes no internal RSP state. It is useful for program instruction padding or insertion into branch delay slots (when no useful work can be done).
Page 202
11 10 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 Format: nor rd, rs, rt Description: The contents of general register are combined with the contents of general register in a bit-wise logical NOR operation.
Page 203
Revision 1.0 11 10 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 Format: or rd, rs, rt Description: The contents of general register are combined with the contents of general register in a bit-wise logical OR operation.
Page 204
Or Immediate immediate 0 0 1 1 0 1 Format: ori rt, rs, immediate Description: immediate The 16-bit is zero-extended and combined with the contents of general register in a bit-wise logical OR operation. The result is placed into general register Operation: GPR[rt] ...
Page 205
Revision 1.0 Store Byte offset base 1 0 1 0 0 0 Format: sb rt, offset(base) Description: offset base The 16-bit is sign-extended and added to the contents of general register to form a DMEM address. The least-significant byte of register is stored at the DMEM address.
Page 206
Store Byte from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 0 0 0 0 Format: sbv vt[element], offset(base) Description: This instruction stores a byte from a vector register into DMEM. offset base The effective address is computed by adding the to the contents of the register (a SU GPR).
Page 207
Revision 1.0 Store Double from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 0 0 1 1 Format: sdv vt[element], offset(base) Description: This instruction stores a double word (64 bits) from a vector register into DMEM. offset base The effective address is computed by adding the...
Page 208
Store Packed Fourth from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 1 0 0 1 Format: sfv vt[element], offset(base) Description: This instruction stores a byte from each of four VU regsiter elements, to every fourth byte of a 128-bit word in DMEM.
Page 209
Revision 1.0 Store Halfword offset base 1 0 1 0 0 1 Format: sh rt, offset(base) Description: offset base The 16-bit is sign-extended and added to the contents of general register to form an unsigned DMEM address. The least-significant halfword of register is stored at the DMEM address.
Page 210
Store Packed Half from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 1 0 0 0 Format: shv vt[0], offset(base) Description: This instruction stores a byte from each of eight VU regsiter elements, to every second byte of a 128-bit word in DMEM.
Page 211
Revision 1.0 Shift Left Logical 11 10 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Format: sll rd, rt, sa Description: The contents of general register are shifted left by bits, inserting zeros into the low-order bits.
Page 212
SLLV SLLV Shift Left Logical Variable 11 10 SPECIAL SLLV 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Format: sllv rd, rt, rs Description: The contents of general register are shifted left the number of bits specified by the low-order five bits contained in general register , inserting zeros into the low-order bits.
Page 213
Revision 1.0 Set On Less Than 11 10 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 Format: slt rd, rs, rt Description: The contents of general register are subtracted from the contents of general register Considering both quantities as signed integers, if the contents of general register are less than the contents of general register...
Page 214
SLTI SLTI Set On Less Than Immediate immediate SLTI 0 0 1 0 1 0 Format: slti rt, rs, immediate Description: immediate The 16-bit is sign-extended and subtracted from the contents of general register Considering both quantities as signed integers, if is less than the sign-extended immediate, the result is set to one;...
Page 215
Revision 1.0 Set On Less Than SLTIU SLTIU Immediate Unsigned immediate SLTIU 0 0 1 0 1 1 Format: sltiu rt, rs, immediate Description: immediate The 16-bit is sign-extended and subtracted from the contents of general register Considering both quantities as unsigned integers, if is less than the sign-extended immediate, the result is set to one;...
Page 216
SLTU SLTU Set On Less Than Unsigned 11 10 SPECIAL SLTU 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 Format: sltu rd, rs, rt Description: The contents of general register are subtracted from the contents of general register Considering both quantities as unsigned integers, if the contents of general register are less than the contents of general register...
Page 217
Revision 1.0 Store Long from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 0 0 1 0 Format: slv vt[element], offset(base) Description: This instruction stores a long word (32 bits) from vector register into DMEM. offset base The effective address is computed by adding the...
Page 218
Store Packed Bytes from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 0 1 1 0 Format: spv vt[0], offset(base) Description: This instruction stores the upper byte from each of eight VU regsiter elements, to consecutive bytes of a 128-bit word in DMEM.
Page 219
Revision 1.0 Store Quad from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 0 1 0 0 Format: sqv vt[0], offset(base) Description: This instruction stores a vector register starting at byte element 0 up to byte (address & 15), to a byte-aligned quad word (128 bits) at the effective address of DMEM up to the 128 bit boundary, that is (address) to ((address &...
Page 220
Shift Right Arithmetic 11 10 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 Format: sra rd, rt, sa Description: The contents of general register are shifted right by bits, sign-extending the high-order bits. The result is placed in register Operation: GPR[rd] ...
Page 221
Revision 1.0 Shift Right SRAV SRAV Arithmetic Variable 11 10 SPECIAL SRAV 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Format: srav rd, rt, rs Description: The contents of general register are shifted right by the number of bits specified by the low-order five bits of general register , sign-extending the high-order bits.
Page 222
Shift Right Logical 11 10 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 Format: srl rd, rt, sa Description: The contents of general register are shifted right by bits, inserting zeros into the high-order bits.
Page 223
Revision 1.0 SRLV SRLV Shift Right Logical Variable 11 10 SPECIAL SRLV 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 Format: srlv rd, rt, rs Description: The contents of general register are shifted right by the number of bits specified by the low-order five bits of general register inserting zeros into the high-order bits.
Page 224
Store Quad (Rest) from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 0 1 0 1 Format: srv vt[e], offset(base) Description: This instruction stores a vector register from byte element (16 - (address & 15)) to 15, to the 128 bit aligned boundary up to the byte address, that is (address &...
Page 225
Revision 1.0 Store Short from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 0 0 0 1 Format: ssv vt[element], offset(base) Description: This instruction stores a half word (16 bits) from a vector register into DMEM. offset base The effective address is computed by adding the...
Page 226
Store Transpose from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 1 0 1 1 Format: stv vt[element], offset(base) Description: This instruction gathers a diagonal vector of shorts from a group of eight VU registers, writing to an aligned 128 bit memory word.
Page 227
Revision 1.0 Subtract 11 10 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 Format: sub rd, rs, rt Description: The contents of general register are subtracted from the contents of general register to form a result.
Page 228
SUBU SUBU Subtract Unsigned 11 10 SPECIAL SUBU 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 Format: subu rd, rs, rt Description: The contents of general register are subtracted from the contents of general register to form a result.
Page 229
Revision 1.0 Store Unsigned Packed from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 0 1 1 1 Format: suv vt[0], offset(base) Description: This instruction stores eight consecutive bytes in DMEM, extracted from the upper bytes of eight VU register elements.
Page 230
Store Word offset base 1 0 1 0 1 1 Format: sw rt, offset(base) Description: offset base The 16-bit is sign-extended and added to the contents of general register to form a DMEM address. The contents of general register are stored at the DMEM location specified by the DMEM address.
Page 231
Revision 1.0 Store Wrapped from Vector Register SWC2 base element offset 1 1 1 0 1 0 0 0 1 1 1 Format: swv vt[element], offset(base) Description: This instruction gathers a diagonal vector of shorts from a group of eight VU registers, writing to an aligned 128 bit memory word.
Page 232
Vector Absolute Value VABS VABS of Short Elements COP2 VABS 0 1 0 0 1 0 0 1 0 0 1 1 Format: vabs vd, vs, vt vabs vd, vs, vt[e] Description: The 16-bit elements of vector register are conditionally negated on an element-by-element basis by the sign of the elements of vector register and placed into vector register .
Page 233
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 234
Vector Add VADD VADD of Short Elements COP2 VADD 0 1 0 0 1 0 0 1 0 0 0 0 Format: vadd vd, vs, vt vadd vd, vs, vt[e] Description: The 16-bit elements of vector register are added on an element-by-element basis to the elements of vector register .
Page 235
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 236
Vector Add Short Elements VADDC VADDC With Carry COP2 VADDC 0 1 0 0 1 0 0 1 0 1 0 0 Format: vaddc vd, vs, vt vaddc vd, vs, vt[e] Description: The 16-bit elements of vector register are added on an element-by-element basis to the elements of vector register .
Page 237
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 238
Vector AND VAND VAND of Short Elements COP2 VAND 0 1 0 0 1 0 1 0 1 0 0 0 Format: vand vd, vs, vt vand vd, vs, vt[e] Description: The 16-bit elements of vector register are AND’d on an element-by-element basis with the elements of vector register The results are placed into vector register If an element specification...
Page 239
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 240
Vector Select Clip Test High COP2 0 1 0 0 1 0 1 0 0 1 0 1 Format: vch vd, vs, vt vch vd, vs, vt[e] Description: The 16-bit elements of vector register are compared and selected on an element-by-element basis with the elements of vector register .
Page 241
Revision 1.0 Operation: 0 16 15...0 0 16 15...0 0 8 7...0 for i in 0...7 if (e = 0000) then /* vector operand */ j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j ...
Page 242
di VR[vd][i*2] 15...0 15...0 neq ~eq and 1 VCC or (ge << (i + 8)) or (le << i) 15...0 15...0 VCO or (neq << (i + 8)) or (sign << i) 15...0 15...0 VCE or (vce <<...
Page 243
Revision 1.0 Vector Select Clip Test Low COP2 0 1 0 0 1 0 1 0 0 1 0 0 Format: vcl vd, vs, vt vcl vd, vs, vt[e] Description: The 16-bit elements of vector register are compared and selected on an element-by-element basis with the elements of vector register .
Page 244
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 245
Revision 1.0 VR[vs][i*2] - VR[vt][j*2] 15...0 15...0 15...0 if (eq) then ge (di >= 0) 15...0 endif (ge) ? VR[vt][j*2] : VR[vs][i*2] 15...0 15...0 15...0 di ACC[i] 15...0 15...0 endif di VR[vd][i*2] 15...0 15...0 VCC and (~(1 || 0 || 1) <<...
Page 246
Vector Select Crimp Test Low COP2 0 1 0 0 1 0 1 0 0 1 1 0 Format: vcr vd, vs, vt vcr vd, vs, vt[e] Description: The 16-bit elements of vector register are compared and selected on an element-by-element basis with the elements of vector register .
Page 247
Revision 1.0 Operation: 0 16 15...0 for i in 0...7 if (e = 0000) then /* vector operand */ j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e &...
Page 249
Revision 1.0 Vector Select Equal COP2 0 1 0 0 1 0 1 0 0 0 0 1 Format: veq vd, vs, vt veq vd, vs, vt[e] Description: The 16-bit elements of vector register are compared and selected on an element-by-element basis with the elements of vector register .
Page 250
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 252
Vector Select Greater Than or Equal COP2 0 1 0 0 1 0 1 0 0 0 1 1 Format: vge vd, vs, vt vge vd, vs, vt[e] Description: The 16-bit elements of vector register are compared and selected on an element-by-element basis with the elements of vector register .
Page 253
Revision 1.0 Operation: VCC 0 for i in 0...7 if (e = 0000) then /* vector operand */ j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e &...
Page 255
Revision 1.0 Vector Select Less Than COP2 0 1 0 0 1 0 1 0 0 0 0 0 Format: vlt vd, vs, vt vlt vd, vs, vt[e] Description: The 16-bit elements of vector register are compared and selected on an element-by-element basis with the elements of vector register .
Page 256
Operation: VCC 0 for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 258
Vector Multiply-Accumulate VMACF VMACF of Signed Fractions COP2 VMACF 0 1 0 0 1 0 0 0 1 0 0 0 Format: vmacf vd, vs, vt vmacf vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register , and added to bits 47...16 of the accumulator.
Page 259
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 260
Vector Accumulator VMACQ VMACQ Oddification COP2 VMACQ 0 1 0 0 1 0 0 0 1 0 1 1 Format: vmacq vd, vs, vt vmacq vd, vs, vt[e] Description: This instruction ignores inputs, and performs oddification of the accumulator by adding (32 <<...
Page 261
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 262
Vector Multiply-Accumulate VMACU VMACU of Unsigned Fractions COP2 VMACU 0 1 0 0 1 0 0 0 1 0 0 1 Format: vmacu vd, vs, vt vmacu vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register , and added to bits 47...16 of the accumulator.
Page 263
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 264
Vector Multiply-Accumulate VMADH VMADH of High Partial Products COP2 VMADH 0 1 0 0 1 0 0 0 1 1 1 1 Format: vmadh vd, vs, vt vmadh vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register , shifted up by 16, and added to bits 31...0 of the accumulator.
Page 265
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 266
Vector Multiply-Accumulate VMADL VMADL of Low Partial Products COP2 VMADL 0 1 0 0 1 0 0 0 1 1 0 0 Format: vmadl vd, vs, vt vmadl vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register , shifted down by 16, and added to bits 31...0 of the accumulator.
Page 267
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 268
Vector Multiply-Accumulate VMADM VMADM of Mid Partial Products COP2 VMADM 0 1 0 0 1 0 0 0 1 1 0 1 Format: vmadm vd, vs, vt vmadm vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register , and added to bits 31...0 of the accumulator.
Page 269
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 270
Vector Multiply-Accumulate VMADN VMADN of Mid Partial Products COP2 VMADN 0 1 0 0 1 0 0 0 1 1 1 0 Format: vmadn vd, vs, vt vmadn vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register , and added to bits 31...0 of the accumulator.
Page 271
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 272
Vector Element VMOV VMOV Scalar Move COP2 VMOV 0 1 0 0 1 0 1 1 0 0 1 1 Format: vmov vd[de], vt[e] Description: The scalar 16-bit element of vector register is moved to the scalar 16-bit element of vector register Operation: ...
Page 273
Revision 1.0 Vector Select VMRG VMRG Merge COP2 VMRG 0 1 0 0 1 0 1 0 0 1 1 1 Format: vmrg vd, vs, vt vmrg vd, vs, vt[e] Description: This instruction selects, on an element by element basis, an element from , based on the value of VCC for that element.
Page 274
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 275
Revision 1.0 Vector Multiply VMUDH VMUDH of High Parital Products COP2 VMUDH 0 1 0 0 1 0 0 0 0 1 1 1 Format: vmudh vd, vs, vt vmudh vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register shifted up by 16, and loaded into the accumulator.
Page 276
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 277
Revision 1.0 Vector Multiply VMUDL VMUDL of Low Parital Products COP2 VMUDL 0 1 0 0 1 0 0 0 0 1 0 0 Format: vmudl vd, vs, vt vmudl vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register shifted down by 16, and loaded into the accumulator.
Page 278
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 279
Revision 1.0 Vector Multiply VMUDM VMUDM of Mid Parital Products COP2 VMUDM 0 1 0 0 1 0 0 0 0 1 0 1 Format: vmudm vd, vs, vt vmudm vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register , and loaded into the accumulator.
Page 280
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 281
Revision 1.0 Vector Multiply VMUDN VMUDN of Mid Parital Products COP2 VMUDN 0 1 0 0 1 0 0 0 0 1 1 0 Format: vmudn vd, vs, vt vmudn vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register , and loaded into the accumulator.
Page 282
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 283
Revision 1.0 Vector Multiply VMULF VMULF of Signed Fractions COP2 VMULF 0 1 0 0 1 0 0 0 0 0 0 0 Format: vmulf vd, vs, vt vmulf vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register , and loaded into the accumulator.
Page 284
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 285
Revision 1.0 Vector Multiply VMULQ VMULQ MPEG Quantization COP2 VMULQ 0 1 0 0 1 0 0 0 0 0 1 1 Format: vmulq vd, vs, vt vmulq vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register , and loaded into the accumulator.
Page 286
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 287
Revision 1.0 Vector Multiply VMULU VMULU of Unsigned Fractions COP2 VMULU 0 1 0 0 1 0 0 0 0 0 0 1 Format: vmulu vd, vs, vt vmulu vd, vs, vt[e] Description: The 16-bit elements of vector register are multiplied on an element-by-element basis to the elements of vector register , and loaded into the accumulator.
Page 288
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 289
Revision 1.0 Vector NAND VNAND VNAND of Short Elements COP2 VNAND 0 1 0 0 1 0 1 0 1 0 0 1 Format: vnand vd, vs, vt vnand vd, vs, vt[e] Description: The 16-bit elements of vector register are NAND’d on an element-by-element basis with the elements of vector register The results are placed into vector register If an element specification...
Page 290
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 291
Revision 1.0 Vector Select Not Equal COP2 0 1 0 0 1 0 1 0 0 0 1 0 Format: vne vd, vs, vt vne vd, vs, vt[e] Description: The 16-bit elements of vector register are compared and selected on an element-by-element basis with the elements of vector register .
Page 292
Operation: VCC 0 for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 294
Vector VNOP VNOP Null Instruction COP2 VNOP 0 1 0 0 1 0 1 1 0 1 1 1 Format: vnop Description: This instruction does nothing; it modifies no registers and changes no internal RSP state. It is useful for program instruction padding or insertion into branch delay slots (when no useful work can be done).
Page 295
Revision 1.0 Vector NOR VNOR VNOR of Short Elements COP2 VNOR 0 1 0 0 1 0 1 0 1 0 1 1 Format: vnor vd, vs, vt vnor vd, vs, vt[e] Description: The 16-bit elements of vector register are NOR’d on an element-by-element basis with the elements of vector register The results are placed into vector register If an element specification...
Page 296
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 297
Revision 1.0 Vector NXOR VNXOR VNXOR of Short Elements COP2 VNXOR 0 1 0 0 1 0 1 0 1 1 0 1 Format: vnxor vd, vs, vt vnxor vd, vs, vt[e] Description: The 16-bit elements of vector register are NXOR’d on an element-by-element basis with the elements of vector register The results are placed into vector register If an element specification...
Page 298
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 299
Revision 1.0 Vector OR of Short Elements COP2 VNOR 0 1 0 0 1 0 1 0 1 0 1 0 Format: vor vd, vs, vt vor vd, vs, vt[e] Description: The 16-bit elements of vector register are OR’d on an element-by-element basis with the elements of vector register The results are placed into vector register If an element specification...
Page 300
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 301
Revision 1.0 Vector Element Scalar VRCP VRCP Reciprocal (Single Precision) COP2 VRCP 0 1 0 0 1 0 1 1 0 0 0 0 Format: vrcp vd[de], vt[e] Description: The 32-bit reciprocal of the scalar 16-bit element of vector register is calculated and the lower 16 bits are stored in the scalar 16-bit element of vector register...
Page 303
Revision 1.0 Vector Element Scalar VRCPH VRCPH Reciprocal (Double Prec. High) COP2 VRCPH 0 1 0 0 1 0 1 1 0 0 1 0 Format: vrcph vd[de], vt[e] Description: The upper 16 bits of the reciprocal previously calculated is stored in the scalar 16-bit element vector register .
Page 304
Vector Element Scalar VRCPL VRCPL Reciprocal (Double Prec. Low) COP2 VRCPL 0 1 0 0 1 0 1 1 0 0 0 1 Format: vrcpl vd[de], vt[e] Description: The 16-bit element of vector register is used as the lower 16 bits of a double-precision reciprocal calculation (combined with data previously loaded by vrcph).
Page 305
Revision 1.0 DivIn addr 15...0 (31-lshift)...(31-lshift-9) rcpRom[addr romData ] 15...0 15...0 0 || 1 || romData 14 result || 0 31...0 15...0 rshift ~lshift and 1 5 0 rshift result || result 31...0 31...(32-rshift) if (VR[vt][e] <...
Page 306
Vector Accumulator VRNDN VRNDN DCT Rounding (Negative) COP2 VRNDN 0 1 0 0 1 0 0 0 1 0 1 0 Format: vrndn vd, vs, vt vrndn vd, vs, vt[e] Description: This instruction is specifically designed to support MPEG DCT rounding. The vector register is shifted left 16 bits if the field is 1 (not the contents of...
Page 307
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 308
Vector Accumulator VRNDP VRNDP DCT Rounding (Positive) COP2 VRNDP 0 1 0 0 1 0 0 0 0 0 1 0 Format: vrndp vd, vs, vt vrndp vd, vs, vt[e] Description: This instruction is specifically designed to support MPEG DCT rounding. The vector register is shifted left 16 bits if the field is 1 (not the contents of...
Page 309
Revision 1.0 Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 310
Vector Element Scalar VRSQ VRSQ SQRT Reciprocal COP2 VRSQ 0 1 0 0 1 0 1 1 0 1 0 0 Format: vrsq vd[de], vt[e] Description: The 32-bit reciprocal of the square root of the scalar 16-bit element of vector register calculated and the lower 16 bits are stored in the scalar 16-bit element of vector register Operation:...
Page 311
Revision 1.0 if (DivIn ) then 31...0 lshift 16 endif DivIn addr 15...0 (31-lshift)...(31-lshift-9) (addr addr or (0 || 1 || 0 )) and (0 || 1 || 0) or (lshift mod 2) 15...0 15...0 rsqRom[addr romData ]...
Page 312
Vector Element Scalar SQRT VRSQH VRSQH Reciprocal (Double Prec. High) COP2 VRSQH 0 1 0 0 1 0 1 1 0 1 1 0 Format: vrsqh vd[de], vt[e] Description: The upper 16 bits of the reciprocal of the square root previously calculated is stored in the scalar 16-bit element of vector register .
Page 313
Revision 1.0 Vector Element Scalar SQRT VRSQL VRSQL Reciprocal (Double Prec. Low) COP2 VRSQL 0 1 0 0 1 0 1 1 0 1 0 1 Format: vrsql vd[de], vt[e] Description: The 16-bit element of vector register is used as the lower 16 bits of a double-precision square root reciprocal calculation (combined with data previously loaded by vrsqh).
Page 314
DivIn addr 15...0 (31-lshift)...(31-lshift-9) (addr addr or (0 || 1 || 0 )) and (0 || 1 || 0) or (lshift mod 2) 15...0 15...0 rsqRom[addr romData ] 15...0 15...0 0 || 1 || romData 14 result || 0 31...0...
Page 315
Revision 1.0 Vector Accumulator VSAR VSAR Read (and Write) COP2 VSAR 0 1 0 0 1 0 0 1 1 1 0 1 Format: vsar vd, vs, vt[e] Description: The upper, middle, or low 16-bit portion of the accumulator elements are selected by and read out to the elements of The elements of...
Page 317
Revision 1.0 Vector Subtraction VSUB VSUB of Short Elements COP2 VSUB 0 1 0 0 1 0 0 1 0 0 0 1 Format: vsub vd, vs, vt vsub vd, vs, vt[e] Description: The 16-bit elements of vector register are subtracted on an element-by-element basis from the elements of vector register .
Page 318
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 319
Revision 1.0 Vector Subtraction of Short VSUBC VSUBC Elements With Carry COP2 VSUBC 0 1 0 0 1 0 0 1 0 1 0 1 Format: vsubc vd, vs, vt vsubc vd, vs, vt[e] Description: The 16-bit elements of vector register are subtracted on an element-by-element basis from the elements of vector register .
Page 320
Operation: 0 16 15...0 for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e &...
Page 321
Revision 1.0 Vector XOR VXOR VXOR of Short Elements COP2 VXOR 0 1 0 0 1 0 1 0 1 1 0 0 Format: vxor vd, vs, vt vxor vd, vs, vt[e] Description: The 16-bit elements of vector register are XOR’d on an element-by-element basis with the elements of vector register The results are placed into vector register If an element specification...
Page 322
Operation: for i in 0...7 if (e = 0000) then /* vector operand */ 3...0 j i elseif ((e & 1110) = 0010) then /* scalar quarter of vector */ 3...0 j (e & 0001) + (i & 1110) 3...0 elseif ((e &...
Page 323
Revision 1.0 Exclusive Or 11 10 SPECIAL 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 Format: xor rd, rs, rt Description: The contents of general register are combined with the contents of general register in a bit-wise logical exclusive OR operation.
Page 324
XORI XORI Exclusive OR Immediate immediate XORI 0 0 1 1 1 0 Format: xori rt, rs, immediate Description: immediate The 16-bit is zero-extended and combined with the contents of general register in a bit-wise logical exclusive OR operation. The result is placed into general register Operation: GPR[rt] ...
Need help?
Do you have a question about the Ultra64 and is the answer not in the manual?
Questions and answers