Sprugh 7
Sprugh 7
Reference Guide
Release History
ø-ii TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com
Contents
Contents
Release History. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ø-ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ø-xvii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ø-xx
List of Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ø-xxiv
Preface ø-xxv
About This Manual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ø-xxv
Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ø-xxv
Related Documentation from Texas Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ø-xxvi
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ø-xxvi
Chapter 1
Introduction 1-1
1.1 DSP Features and Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
1.1.1 4x Multiply. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
1.1.2 Floating point support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
1.1.3 Vector Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
1.1.4 Complex arithmetic and matrix operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
1.2 DSP Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5
1.2.1 Central Processing Unit (CPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
1.2.2 Internal Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
1.2.3 Memory and Peripheral Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
Chapter 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-iii
Submit Documentation Feedback
Contents
www.ti.com
Chapter 3
ø-iv TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com
Contents
Chapter 4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-v
Submit Documentation Feedback
Contents
www.ti.com
ø-vi TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com
Contents
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-vii
Submit Documentation Feedback
Contents
www.ti.com
ø-viii TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com
Contents
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-ix
Submit Documentation Feedback
Contents
www.ti.com
ø-x TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com
Contents
Chapter 5
Pipeline 5-1
5.1 Pipeline Operation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
5.1.1 Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
5.1.2 Decode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
5.1.3 Execute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
5.1.4 Pipeline Operation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
5.2 Pipeline Execution of Instruction Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9
5.2.1 Single-Cycle Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-11
5.2.2 Two-Cycle Instructions and .M Unit Nonmultiply Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-12
5.2.3 Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-13
5.2.4 Extended Multiply Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-14
5.2.5 Load Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-15
5.2.6 Branch Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-16
5.2.7 Two-Cycle DP Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-17
5.2.8 Three-Cycle Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.2.9 Four-Cycle Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.2.10 INTDP Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-19
5.2.11 Double-Precision (DP) Compare Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-19
5.2.12 ADDDP/SUBDP Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-20
5.2.13 MPYI Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-21
5.2.14 MPYID Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-21
5.2.15 MPYDP Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-21
5.2.16 MPYSPDP Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-22
5.2.17 MPYSP2DP Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-23
5.3 Functional Unit Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-24
5.3.1 .S-Unit Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-24
5.3.2 .M-Unit Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-28
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-xi
Submit Documentation Feedback
Contents
www.ti.com
Chapter 6
Interrupts 6-1
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.1.1 Types of Interrupts and Signals Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.1.1.1 Reset (RESET) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.1.1.2 Nonmaskable Interrupt (NMI). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.1.1.3 Maskable Interrupts (INT4-INT15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
6.1.1.4 Interrupt Service Table (IST) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
6.1.1.5 Interrupt Service Fetch Packet (ISFP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.1.1.6 Interrupt Service Table Pointer (ISTP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
6.1.2 Summary of Interrupt Control Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
6.2 Globally Enabling and Disabling Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-10
6.3 Individual Interrupt Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-13
6.3.1 Enabling and Disabling Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-13
6.3.2 Status of Interrupts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-13
6.3.3 Setting and Clearing Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-14
6.3.4 Returning From Interrupt Servicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-14
6.3.4.1 CPU State After RESET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-14
6.3.4.2 Returning From Nonmaskable Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-15
6.3.4.3 Returning From Maskable Interrupts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-15
6.4 Interrupt Detection and Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-16
6.4.1 Setting the Nonreset Interrupt Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-17
6.4.1.1 Detection of Missed Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-18
6.4.2 Conditions for Processing a Nonreset Interrupt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-18
6.4.3 Saving TSR Context in Nonreset Interrupt Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-19
6.4.4 Actions Taken During Nonreset Interrupt Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-20
6.4.5 Conditions for Processing a Nonmaskable Interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-20
6.4.6 Saving of Context in Nonmaskable Interrupt Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-23
6.4.7 Actions Taken During Nonmaskable Interrupt Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-23
6.4.8 Setting the RESET Interrupt Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-23
6.4.9 Actions Taken During RESET Interrupt Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-24
6.5 Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-25
6.5.1 General Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-25
6.5.2 Pipeline Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-25
6.6 Programming Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-26
6.6.1 Single Assignment Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-26
6.6.2 Nested Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-27
6.6.3 Manual Interrupt Processing (polling). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-28
6.6.4 Traps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-28
Chapter 7
ø-xii TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com
Contents
Chapter 8
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-xiii
Submit Documentation Feedback
Contents
www.ti.com
Chapter 9
ø-xiv TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com
Contents
Appendix A
Appendix B
Appendix C
Appendix D
Appendix E
Appendix F
Appendix G
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-xv
Submit Documentation Feedback
Contents
www.ti.com
Appendix H
ø-xvi TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com
List of Tables
List of Tables
Table 1-1 Raw Performance Comparison Between the C674x and the C66x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
Table 2-1 64-bit Register Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
Table 2-2 128-bit Register Quadruplets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
Table 2-3 Modulo 2 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
Table 2-4 Modulo 5 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
Table 2-5 Modulo Arithmetic for Field GF(23) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
Table 2-6 Control Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
Table 2-7 Addressing Mode Register (AMR) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
Table 2-8 Block Size Calculations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13
Table 2-9 Control Status Register (CSR) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
Table 2-10 Galois Field Polynomial Generator Function Register (GFPGFR) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
Table 2-11 Interrupt Clear Register (ICR) Field Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
Table 2-12 Interrupt Enable Register (IER) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
Table 2-13 Interrupt Flag Register (IFR) Field Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
Table 2-14 Interrupt Set Register (ISR) Field Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21
Table 2-15 Interrupt Service Table Pointer Register (ISTP) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22
Table 2-16 Control Register File Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
Table 2-17 Exception Flag Register (EFR) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
Table 2-18 Internal Exception Report Register (IERR) Field Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
Table 2-19 NMI/Exception Task State Register (NTSR) Field Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29
Table 2-20 Saturation Status Register Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30
Table 2-21 Task State Register (TSR) Field Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33
Table 2-22 Control Register File Extensions for Floating-Point Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34
Table 2-23 Floating-Point Adder Configuration Register (FADCR) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35
Table 2-24 Floating-Point Auxiliary Configuration Register (FAUCR) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-38
Table 2-25 Floating-Point Multiplier Configuration Register (FMCR) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-40
Table 3-1 Instruction Operation and Execution Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Table 3-2 Instruction Syntax and Opcode Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
Table 3-3 IEEE Floating-Point Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
Table 3-4 Special Single-Precision Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Table 3-5 Hexadecimal and Decimal Representation for Selected Single-Precision Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Table 3-6 Special Double-Precision Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
Table 3-7 Hexadecimal and Decimal Representation for Selected Double-Precision Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
Table 3-8 Delay Slot and Functional Unit Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
Table 3-9 Registers That Can Be Tested by Conditional Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
Table 3-10 Indirect Address Generation for Load/Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
Table 3-11 Address Generator Options for Load/Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
Table 3-12 Layout Field Description in Compact Instruction Packet Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30
Table 3-13 Expansion Field Description in Compact Instruction Packet Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
Table 3-14 LD/ST Data Size Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
Table 3-15 P-bits Field Description in Compact Instruction Packet Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
Table 3-16 Available Compact Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34
Table 4-1 Relationships Between Operands, Operand Size, Functional Units, and Opfields for Example Instruction (ADD) . . . . . 4-3
Table 4-2 Program Counter Values for Branch Using a Displacement Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-59
Table 4-3 Program Counter Values for Branch Using a Register Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-61
Table 4-4 Program Counter Values for B IRP Instruction Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-63
Table 4-5 Program Counter Values for B NRP Instruction Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-65
Table 4-6 Data Types Supported by LDB(U) Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-358
Table 4-7 Data Types Supported by LDB(U) Instruction (15-Bit Offset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-362
Table 4-8 Data Types Supported by LDH(U) Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-369
Table 4-9 Data Types Supported by LDH(U) Instruction (15-Bit Offset). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-372
Table 4-10 Register Addresses for Accessing the Control Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-472
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-xvii
Submit Documentation Feedback
List of Tables
www.ti.com
ø-xviii TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com
List of Tables
Table 8-1 SPLOOP Instruction Flow for Example 8-4 and Example 8-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
Table 8-2 SPLOOPW Instruction Flow for Example 8-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
Table 8-3 Software Pipeline Instruction Flow Using the Loop Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Table 8-4 SPLOOPD Minimum Loop Iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23
Table 8-5 SPLOOP Instruction Flow for First Three Cycles of Example 8-15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-32
Table 8-6 SPLOOP Instruction Flow for Example 8-15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-34
Table A-1 Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
Table B-1 Instruction to Functional Unit Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
Table C-1 Instructions Executing in the .D Functional Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Table C-2 .D Unit Opcode Map Symbol Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Table C-3 Address Generator Options for Load/Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-3
Table D-1 Instructions Executing in the .L Functional Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-1
Table D-2 .L Unit Opcode Map Symbol Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-2
Table E-1 Instructions Executing in the .M Functional Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-1
Table E-2 .M Unit Opcode Map Symbol Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-2
Table F-1 Instructions Executing in the .S Functional Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-1
Table F-2 .S Unit Opcode Map Symbol Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-2
Table G-1 .D, .L, and .S Units Opcode Map Symbol Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .G-1
Table H-1 Instructions Executing With No Unit Specified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .H-2
Table H-2 No Unit Specified Instructions Opcode Map Symbol Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .H-2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-xix
Submit Documentation Feedback
List of Figures
www.ti.com
List of Figures
Figure 1-1 QMPY32 - Example of vector instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
Figure 1-2 TMS320C66x DSP Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5
Figure 2-1 CPU Data Paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
Figure 2-2 Addressing Mode Register (AMR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
Figure 2-3 Control Status Register (CSR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
Figure 2-4 PWRD Field of Control Status Register (CSR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
Figure 2-5 Galois Field Polynomial Generator Function Register (GFPGFR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
Figure 2-6 Interrupt Clear Register (ICR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
Figure 2-7 Interrupt Enable Register (IER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
Figure 2-8 Interrupt Flag Register (IFR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
Figure 2-9 Interrupt Return Pointer Register (IRP). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21
Figure 2-10 Interrupt Set Register (ISR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21
Figure 2-11 Interrupt Service Table Pointer Register (ISTP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22
Figure 2-12 NMI Return Pointer Register (NRP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23
Figure 2-13 E1 Phase Program Counter (PCE1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23
Figure 2-14 DSP Core Number Register (DNUM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
Figure 2-15 Exception Flag Register (EFR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
Figure 2-16 GMPY Polynomial Af-Side Register (GPLYA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
Figure 2-17 GMPY Polynomial B-Side (GPLYB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26
Figure 2-18 Internal Exception Report Register (IERR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
Figure 2-19 Inner Loop Count Register (ILC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-28
Figure 2-20 Interrupt Task State Register (ITSR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-28
Figure 2-21 NMI/Exception Task fState Register (NTSR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29
Figure 2-22 Reload Inner Loop Count Register (RILC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30
Figure 2-23 Saturation Status Register (SSR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30
Figure 2-24 Time Stamp Counter Register - Low Half (TSCL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31
Figure 2-25 Time Stamp Counter Register - High Half (TSCH). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31
Figure 2-26 Task State Register (TSR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33
Figure 2-27 Floating-Point Adder Configuration Register (FADCR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35
Figure 2-28 Floating-Point Auxiliary Configuration Register (FAUCR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-37
Figure 2-29 Floating-Point Multiplier Configuration Register (FMCR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-40
Figure 3-1 Single-Precision Floating-Point Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Figure 3-2 Double-Precision Floating-Point Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
Figure 3-3 Basic Format of a Fetch Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
Figure 3-4 Examples of the Detectability of Write Conflicts by the Assembler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19
Figure 3-5 CPU Fetch Packet Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29
Figure 3-6 Compact Instruction Header Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30
Figure 3-7 Layout Field in Compact Header Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30
Figure 3-8 Expansion Field in Compact Header Word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
Figure 3-9 P-bits Field in Compact Header Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
Figure 5-1 Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Figure 5-2 Fetch Phases of the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Figure 5-3 Decode Phases of the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Figure 5-4 Execute Phases of the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Figure 5-5 Pipeline Phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Figure 5-6 Pipeline Operation: One Execute Packet per Fetch Packet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Figure 5-7 Pipeline Phases Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Figure 5-8 Single-Cycle Instruction Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
Figure 5-9 Single-Cycle Instruction Execution Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
Figure 5-10 Two-Cycle Instruction Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12
Figure 5-11 Single 16 × 16 Multiply Instruction Execution Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12
Figure 5-12 Store Instruction Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
ø-xx TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com
List of Figures
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-xxi
Submit Documentation Feedback
List of Figures
www.ti.com
ø-xxii TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com
List of Figures
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-xxiii
Submit Documentation Feedback
List of Examples
www.ti.com
List of Examples
Example 2-1 Code to Read the 64-Bit TSC Value in Branch Delay Slot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32
Example 2-2 Code to Read the 64-Bit TSC Value Using DINT/RINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32
Example 3-1 Fully Serial p-Bit Pattern in a Fetch Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
Example 3-2 Fully Parallel p-Bit Pattern in a Fetch Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
Example 3-3 Partially Serial p-Bit Pattern in a Fetch Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
Example 3-4 LDW Instruction in Circular Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25
Example 3-5 ADDAH Instruction in Circular Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26
Example 3-6 LDNW in Circular Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-27
Example 5-1 Execute Packet in Figure 5-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
Example 6-1 Relocation of Interrupt Service Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
Example 6-2 Interrupts Versus Writes to GIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
Example 6-3 Code Sequence to Disable Maskable Interrupts Globally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11
Example 6-4 Code Sequence to Enable Maskable Interrupts Globally. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11
Example 6-5 Code Sequence with Disable Global Interrupt Enable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Example 6-6 Code Sequence with Restore Global Interrupt Enable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Example 6-7 Code Sequence with Disable Reenable Interrupt Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Example 6-8 Code Sequence to Enable an Individual Interrupt (INT9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13
Example 6-9 Code Sequence to Disable an Individual Interrupt (INT9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13
Example 6-10 Code to Set an Individual Interrupt (INT6) and Read the Flag Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
Example 6-11 Code to Clear an Individual Interrupt (INT6) and Read the Flag Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
Example 6-12 Code to Return From NMI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
Example 6-13 Code to Return from a Maskable Interrupt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
Example 6-14 Code Without Single Assignment: Multiple Assignment of A1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-26
Example 6-15 Code Using Single Assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-26
Example 6-16 Manual Interrupt Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-28
Example 6-17 Code Sequence to Invoke a Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-28
Example 6-18 Code Sequence for Trap Return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-29
Example 7-1 Code to Return From Exception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
Example 7-2 Code to Quickly Detect OS Service Request. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16
Example 8-1 SPMASK Using Unit Mask to Indicate Masked Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
Example 8-2 SPMASK Using Caret to Indicate Masked Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
Example 8-3 Copy Loop Coded as C Fragment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
Example 8-4 SPLOOP Implementation of Copy Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
Example 8-5 SPLOOPD Implementation of Copy Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
Example 8-6 Example C Coded Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
Example 8-7 SPLOOPW Implementation of C Coded Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
Example 8-8 SPLOOP, SPLOOP Body, and SPKERNEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Example 8-9 Using ILC With the SPLOOP Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23
Example 8-10 Using ILC With a SPLOOPD Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-24
Example 8-11 Using ILC With a SPLOOPD Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26
Example 8-12 Using the SPLOOPW Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-30
Example 8-13 strcpy() Using the SPLOOPW Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-30
Example 8-14 Using the SPMASK Instruction to Merge Setup Code with SPLOOPW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-32
Example 8-15 Using the SPMASK Instruction to Merge Reset Code with SPLOOP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-33
Example 8-16 Initiating a Branch Prior to SPLOOP Body . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-39
ø-xxiv TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Preface
Notational Conventions
This document uses the following conventions:
• Commands and keywords are in boldface font.
• Arguments for which you supply values are in italic font.
• Terminal sessions and information the system displays are in screen font.
• Information you must enter is in boldface screen font.
• Elements in square brackets ([ ]) are optional.
The information in a caution or a warning is provided for your protection. Please read
each caution and warning carefully.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide ø-xxv
Submit Documentation Feedback
Preface www.ti.com
Trademarks
VLYNQ, PIQUA Software, wONE, PBCC, Uni-DSL, Dynamic Adaptive Equalization, Telinnovation, TurboDSL Packet
Accelerator, interOps Test Labs, TurboDOX, and INCA are trademarks of Texas Instruments Incorporated.
All other brand names and trademarks mentioned in this document are the property of Texas Instruments
Incorporated or their respective owners, as applicable.
ø-xxvi TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Chapter 1
Introduction
The TMS320C66x is the next-generation fixed and floating-point DSP. The new DSP
enhances the TMS320C674x, which merged the TMS320C67x+ floating point and the
TMS320C64x+™fixed point instruction set architectures.
This document describes the CPU architecture, pipeline, instruction set, and interrupts
of the C66x DSP. The C66x CorePac is the name used to designate the CPU together
with the hardware providing memory, bandwidth management, interrupt, memory
protection, and power-down support. The C66x CorePac is not described in this
document because it is fully covered in the C66x CorePac User Guide (SPRUGW0).
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 1-1
Submit Documentation Feedback
1.1 DSP Features and Options
Chapter 1—Introduction www.ti.com
1-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
1.1 DSP Features and Options
www.ti.com Chapter 1—Introduction
The C64x+, C674x, and C66x devices include these additional features:
• Compact instructions: Common instructions (AND, ADD, LD, MPY) have
16-bit versions to reduce code size.
• Protected mode operation: A two-level system of privileged program execution
to support higher-capability operating systems and system features such as
memory protection.
• Exceptions support for error detection and program redirection to provide
robust code execution
• Hardware support for modulo loop operation to reduce code size and allow
interrupts during fully-pipelined code
• Each multiplier can perform 32 × 32 bit multiplies
• Additional instructions to support complex multiplies allowing up to eight 16-bit
multiply/add/subtracts per clock cycle
The C66x has the following key improvements to the ISA:
• 4x Multiply Accumulate improvement for both fixed and floating point
• Improvement of the floating point arithmetic
• Enhancement of the vector processing capability for fixed and floating point
• Addition of domain-specific instructions for complex arithmetic and matrix
operations.
1.1.1 4x Multiply
The new C66x Core ISA significantly improves the maximum number multiply
operations that can be executed per cycle. The core can now execute up to 32
(16x16-bit) multiplies per cycle or up to 8 single-precision floating-point multiplies per
cycle.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 1-3
Submit Documentation Feedback
1.1 DSP Features and Options
Chapter 1—Introduction www.ti.com
x x x x
For example, the C66x can now perform up to two multiplications of a [1x2] complex
vector by a [2x2] complex matrix per cycle and provides a set of instructions that
operates either on scalar or vector numbers to perform complex multiplications,
conjugate of a complex number, multiplication by the conjugate, rotations of complex
numbers, ….
Table 1-1 provides a comparison of the raw performance between the C674x and the
C66x ISA.
Table 1-1 Raw Performance Comparison Between the C674x and the C66x
C674x C66x
Fixed point 16x16 MACs per cycle 8 32
Fixed point 32x32 MACs per cycle 2 8
Floating point single precision 2 8
MACs per cycle
Arithmetic floating point 61 162
operations per cycle
Arithmetic floating point 63 164
operations per cycle
Load/store width 2 x 64-bit 2 x 64-bit
Vector size 32-bit 128-bit
(SIMD capability)
(2 x 16-bit, 4x-8bits) (4 x 32-bit, 4 x 16-bit,
4x-8bits)
1. One operation per .L, .S, .M units for each side (A and B)
2. 2-way SIMD on .L and .S units (e.g. 8 SP operations for A and B) and 4 SP multiply on one .M unit
(e.g 8 SP operations for A and B)
3. One operation per .L, .S, .M units for each side (A and B)
4. 2-way SIMD on .L and .S units (e.g. 8 SP operations for A and B) and 4 SP multiply on one .M unit
(e.g 8 SP operations for A and B)
1-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
1.2 DSP Architecture
www.ti.com Chapter 1—Introduction
32KB L1P
Controller (UMC)
Unified Memory
Memory Protect/Bandwidth Mgmt
L2 Cache/
SRAM
C66x DSP Core 512KB
Instruction Fetch
Extended Memory
Boot SRAM
Controller (XMC)
Controller Instruction Decode 4096KB
Data Path A Data Path B
DDR3
A Register File B Register File SRAM
PLLC LPSC
A31-A16 B31-B16
A15-A0 B15-B0
External Memory
Controller (EMC)
xx xx
32KB L1D
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 1-5
Submit Documentation Feedback
1.2 DSP Architecture
Chapter 1—Introduction www.ti.com
The program fetch, instruction dispatch, and instruction decode units can deliver up to
fifteen 16- and/or 32-bit instructions to the functional units every CPU clock cycle. The
processing of instructions occurs in each of the two data paths (A and B), each of which
contains four functional units (.L, .S, .M, and .D) and 32 32-bit general-purpose
registers. The data paths are described in more detail in Chapter 2 ‘‘CPU Data Paths
and Control’’ on page 2-1. A control register file provides the means to configure and
control various processor operations. To understand how instructions are fetched,
dispatched, decoded, and executed in the data path, see Chapter 5 ‘‘Pipeline’’ on
page 5-1.
The DSP has a 256-bit read-only port to access internal program memory and two
256-bit ports (read and write) to access internal data memory.
1-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Chapter 2
This chapter focuses on the CPU, providing information about the data paths and
control registers. The two register files and the data cross paths are described.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-1
Submit Documentation Feedback
2.1 Introduction
Chapter 2—CPU Data Paths and Control www.ti.com
2.1 Introduction
The components of the data path for the CPU are shown in Figure 2-1 on page 2-3.
These components consist of:
• Two general-purpose register files (A and B)
• Eight functional units (.L1, .L2, .S1, .S2, .M1, .M2, .D1, and .D2)
• Two load-from-memory data paths (LD1 and LD2)
• Two store-to-memory data paths (ST1 and ST2)
• Two data address paths (DA1 and DA2)
• Two register file data cross paths (1X and 2X)
2-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.2 General-Purpose Register Files
www.ti.com Chapter 2—CPU Data Paths and Control
Note:
src1
Default bus width
is 64 bits
(i.e. a register pair) .L1 Register
src2
File A
dst (A0, A1, A2,
ST1
...A31)
src1
.S1 src2
dst
src1
Data Path A
src1_hi
src2
.M1
src2_hi
dst2
dst1
LD1
32
src1
dst
DA1 32
.D1 32
src2
32 32
2´
1´
src2
32 Register
32
32
File B
DA2 .D2
32 dst 32 (B0, B1, B2,
src1 ...B31)
32
LD2
dst1
dst2
src2_hi
.M2
src2
src1_hi
src1
Data Path B
dst
.S2 src2
src1
ST2
dst
src2
.L2
src1
32 Control
Register
C66140 32
The DSP general purpose register files support data ranging in size from packed 8-bit
through 128-bit fixed point data.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-3
Submit Documentation Feedback
2.2 General-Purpose Register Files
Chapter 2—CPU Data Paths and Control www.ti.com
Values larger than 32 bits (such as 40-bit values and 64-bit quantities) are stored in
register pairs. Values larger than 64-bits (such as 128-bit quantities) are stored in a pair
of register pair (a quadruplet of registers).
Packed data types store either four 8-bit values or two 16-bit values in a single 32-bit
register, eight 8-bit values or four 16-bit values or two 32-bit value in a single 64-bit
register pair, and eight 16-bit values or four 32-bit values or two 64-bit values in a single
128-bit register quadruplet.
Table 2-1 64-bit Register Pairs
Register File
A B
A1:A0 B1:B0
A3:A2 B3:B2
A5:A4 B5:B4
A7:A6 B7:B6
A9:A8 B9:B8
A11:A10 B11:B10
A13:A12 B13:B12
A15:A14 B15:B14
A17:A16 B17:B16
A19:A18 B19:B18
A21:A20 B21:B20
A23:A22 B23:B22
A25:A24 B25:B24
A27:A26 B27:B26
A29:A28 B29:B28
A31:A30 B31:B30
2-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.3 Functional Units
www.ti.com Chapter 2—CPU Data Paths and Control
The .L/.S units now support up to 64-bit operands. The src1 and dst ports are now
64-bit wide; src2 is also 64-bit and can directly transport long 40-bit data (the long src
expansion port has been removed).
The .M units now support up to 128-bit source operands. Each source operands can use
up to two 64-bit read ports (src1 and src1_hi for the first operand, src2 and src2_hi for
the second operand). The src2 input of the .M unit is selectable between the cross path
and the same-side register file.
Since the .M unit can return up to 128-bit results, it includes two 64-bit write ports
(dst1 and dst2) to the register file. The dst1 and dst2 ports can be used in combination
to transport a 128-bit results generated by one 4-cycle instruction or to transport two
independent results (one result of 64 bits or less from a 2-cycle instruction and another
result of 64 bits or less from a 4-cycle instruction).
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-5
Submit Documentation Feedback
2.4 Register File Cross Paths
Chapter 2—CPU Data Paths and Control www.ti.com
On the C66x, the 1x and 2x cross paths allow functional units from one data path to
access a 64-bit operand from the opposite side register file. On C64x+ and C674x, the
cross paths are limited to 32-bit operand.
Therefore the C66x enables new combinations of operands for existing C64x+/C674x
instructions. The list of C64x+/C674x instructions that can now use the crosspath for
long 40-bit data is:
ABS .Lx xlong, slong
ADD .Lx scst5, xlong, slong
CMPEQ .Lx scst5, xlong, uint
CMPGT .Lx scst5, xlong, uint
CMPGTU .Lx ucst5, xulong, uint
CMPLT .Lx scst5, xlong, uint
CMPLTU .Lx ucst5, xlong, uint
NORM .Lx xlong, uint
SADD .Lx scst5, xlong, slong
SAT .Lx xlong, slong
SHL .Sx xlong, (uint or ucst5), slong
SHR .Sx xlong, (uint or ucst5), slong
SHRU .Sx xulong, (uint or ucst5), ulong
SSUB .Lx scst5, xlong, slong
SUB .Lx scst5, xlong, slong
On the C6000 architecture, some of the ports for long and doubleword operands are
shared between functional units. This places a constraint on which long or doubleword
operations can be scheduled on a data path in the same execute packet. See Section
3.8.7 on page 3-18.
2-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.6 Data Address Paths
www.ti.com Chapter 2—CPU Data Paths and Control
The DA1 and DA2 resources and their associated data paths are specified as T1 and T2,
respectively. T1 consists of the DA1 address path and the LD1 and ST1 data paths. For
the DSP, LD1 is comprised of LD1a and LD1b to support 64-bit loads; ST1 is comprised
of ST1a and ST1b to support 64-bit stores. Similarly, T2 consists of the DA2 address
path and the LD2 and ST2 data paths. For the DSP, LD2 is comprised of LD2a and
LD2b to support 64-bit loads; ST2 is comprised of ST2a and ST2b to support 64-bit
stores.
The T1 and T2 designations appear in the functional unit fields for load and store
instructions. For example, the following load instruction uses the .D1 unit to generate
the address but is using the LD2 path resource from DA2 to place the data in the B
register file. The use of the DA2 resource is indicated with the T2 designation.
LDW .D1T2 *A0[3],B1
The DSP contains Galois field multiply hardware that is used for Reed-Solomon encode
and decode functions. To understand the relevance of the Galois field multiply
hardware, it is necessary to first define some mathematical terms.
Two kinds of number systems that are common in algorithm development are integers
and real numbers. For integers, addition, subtraction, and multiplication operations
can be performed. Division can also be performed, if a nonzero remainder is allowed.
For real numbers, all four of these operations can be performed, even if there is a
nonzero remainder for division operations.
Real numbers can belong to a mathematical structure called a field. A field consists of
a set of data elements along with addition, subtraction, multiplication, and division. A
field of integers can also be created if modulo arithmetic is performed.
An example is doing arithmetic using integers modulo 2. Perform the operations using
normal integer arithmetic and then take the result modulo 2. Table 2-3 illustrates
addition, subtraction, and multiplication modulo 2.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-7
Submit Documentation Feedback
2.7 Galois Field
Chapter 2—CPU Data Paths and Control www.ti.com
Note that addition and subtraction results are the same, and in fact are equivalent to the
XOR (exclusive-OR) operation in binary. Also, the multiplication result is equal to the
AND operation in binary. These properties are unique to modulo 2 arithmetic, but
modulo 2 arithmetic is used extensively in error correction coding. Another more
general property is that division by any nonzero element is now defined. Division can
always be performed, if every element other than zero has a multiplicative inverse:
x × x-1 = 1
Another example, arithmetic modulo 5, illustrates this concept more clearly. The
addition, subtraction, and multiplication tables are given in Table 2-4.
Table 2-4 Modulo 5 Arithmetic
Addition Subtraction Multiplication
+ 0 1 2 3 4 - 0 1 2 3 4 × 0 1 2 3 4
0 0 1 2 3 4 0 0 4 3 2 1 0 0 0 0 0 0
1 1 2 3 4 0 1 1 0 4 3 2 1 0 1 2 3 4
2 2 3 4 0 1 2 2 1 0 4 3 2 0 2 4 1 3
3 3 4 0 1 2 3 3 2 1 0 4 3 0 3 1 4 2
4 4 0 1 2 3 4 4 3 2 1 0 4 0 4 3 2 1
In the rows of the multiplication table, element 1 appears in every nonzero row and
column. Every nonzero element can be multiplied by at least one other element for a
result equal to 1. Therefore, division always works and arithmetic over integers modulo
5 forms a field. Fields generated in this manner are called finite fields or Galois fields
and are written as GF(X), such as GF(2) or GF(5). They only work when the arithmetic
performed is modulo a prime number.
Galois fields can also be formed where the elements are vectors instead of integers if
polynomials are used. Finite fields, therefore, can be found with a number of elements
equal to any power of a prime number. Typically, we are interested in implementing
error correction coding systems using binary arithmetic. All of the fields that are dealt
with in Reed Solomon coding systems are of the form GF(2m). This allows performing
addition using XORs on the coefficients of the vectors, and multiplication using a
combination of ANDs and XORs.
A final example considers the field GF(23), which has 8 elements. This can be generated
by arithmetic modulo the (irreducible) polynomial P(x) = x3 + x + 1. Elements of this
field look like vectors of three bits. Table 2-5 shows the addition and multiplication
tables for field GF(23).
Note that the value 1 (001) appears in every nonzero row of the multiplication table,
which indicates that this is a valid field.
The channel error can now be modeled as a vector of bits, with a one in every bit
position that an error has occurred, and a zero where no error has occurred. Once the
error vector has been determined, it can be subtracted from the received message to
determine the correct code word.
2-8 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.7 Galois Field
www.ti.com Chapter 2—CPU Data Paths and Control
The Galois field multiply hardware on the DSP is named GMPY4. The GMPY4
instruction performs four parallel operations on 8-bit packed data on the .M unit. The
Galois field multiplier can be programmed to perform all Galois multiplies for fields of
the form GF(2m), where m can range between 1 and 8 using any generator polynomial.
The field size and the polynomial generator are controlled by the Galois field
polynomial generator function register (GFPGFR).
In addition to the GMPY4 instruction, the C674x DSP has the GMPY instruction that
uses either the GPLYA or GPLYB control register as a source for the polynomial
(depending on whether the A or B side functional unit is used) and produces a 32-bit
result.
The GFPGFR, shown in Figure 2-5 and described in Table 2-10, contains the Galois
field polynomial generator and the field size control bits. These bits control the
operation of the GMPY4 instruction. GFPGFR can only be set via the MVC
instruction. The default function after reset for the GMPY4 instruction is field size =
7h and polynomial = 1Dh.
Multiplication
× 000 001 010 011 100 101 110 111
000 000 000 000 000 000 000 000 000
001 000 001 010 011 100 101 110 111
010 000 010 100 110 011 001 111 101
011 000 011 110 101 111 100 001 010
100 000 100 011 111 110 010 101 001
101 000 101 001 100 010 111 011 110
110 000 110 111 001 101 011 010 100
111 000 111 101 010 001 110 100 011
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-9
Submit Documentation Feedback
2.8 Control Register File
Chapter 2—CPU Data Paths and Control www.ti.com
2-10 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.8 Control Register File
www.ti.com Chapter 2—CPU Data Paths and Control
Additionally, some of the control register bits are specially accessed in other ways. For
example, arrival of a maskable interrupt on an external interrupt pin, INTm, triggers
the setting of flag bit IFRm. Subsequently, when that interrupt is processed, this triggers
the clearing of IFRm and the clearing of the global interrupt enable bit, GIE. Finally,
when that interrupt processing is complete, the B IRP instruction in the interrupt
service routine restores the pre-interrupt value of the GIE. Similarly, saturating
instructions like SADD set the SAT (saturation) bit in the control status register (CSR).
On the CPU, access to some of the registers is restricted when in User mode. See
Chapter 9 ‘‘CPU Privilege’’ on page 9-1 for more information.
Even though MVC modifies the particular target control register in a single cycle, it can
take extra clocks to complete modification of the non-explicitly named register. For
example, the MVC cannot modify bits in the IFR directly. Instead, MVC can only write
1's into the ISR or the ICR to specify setting or clearing, respectively, of the IFR bits.
MVC completes this ISR/ICR write in a single (E1) cycle but the modification of the
IFR bits occurs one clock later. For more information on the manipulation of ISR, ICR,
and IFR, see section 2.8.10 ‘‘Interrupt Set Register (ISR)’’ on page 2-21, section
2.8.6 ‘‘Interrupt Clear Register (ICR)’’ on page 2-18, and section 2.8.8 ‘‘Interrupt
Flag Register (IFR)’’ on page 2-20.
Saturating instructions, such as SADD, set the saturation flag bit (SAT) in CSR
indirectly. As a result, several of these instructions update the SAT bit one full clock
cycle after their primary results are written to the register file. For example, the SMPY
instruction writes its result at the end of pipeline stage E2; its primary result is available
after one delay slot. In contrast, the SAT bit in CSR is updated one cycle later than the
result is written; this update occurs after two delay slots. (For the specific behavior of
an instruction, refer to the description of that individual instruction).
The B IRP and B NRP instructions directly update the GIE and NMIE bits,
respectively. Because these branches directly modify CSR and IER, respectively, there
are no delay slots between when the branch is issued and when the control register
updates take effect.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-11
Submit Documentation Feedback
2.8 Control Register File
Chapter 2—CPU Data Paths and Control www.ti.com
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -n = value after reset
2-12 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.8 Control Register File
www.ti.com Chapter 2—CPU Data Paths and Control
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-13
Submit Documentation Feedback
2.8 Control Register File
Chapter 2—CPU Data Paths and Control www.ti.com
2-14 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.8 Control Register File
www.ti.com Chapter 2—CPU Data Paths and Control
The power-down modes and their wake-up methods are programmed by the PWRD
field (bits 15-10) of CSR. The PWRD field of CSR is shown in. When writing to CSR,
all bits of the PWRD field should be configured at the same time. A logic 0 should be
used when writing to the reserved bit (bit 15) of the PWRD field.
The PWRD, PCC, DCC, and PGIE fields cannot be written in User mode. The PCC and
DCC fields can only be modified in Supervisor mode. See Chapter 9 ‘‘CPU Privilege’’
on page 9-1 for more information.
CPU ID REVISION ID
R-x1 R-x1
15 10 9 8 7 5 4 2 1 0
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; SW = Writable by the MVC instruction only in supervisor mode;
WC = Bit is cleared on write; -n = value after reset; -x = value is indeterminate after reset
1. See the device-specific data sheet for the default value of this field.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-15
Submit Documentation Feedback
2.8 Control Register File
Chapter 2—CPU Data Paths and Control www.ti.com
2-16 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.8 Control Register File
www.ti.com Chapter 2—CPU Data Paths and Control
15 8 7 0
Reserved POLY
R-0 R/W-1Dh
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -n = value after reset
Table 2-10 Galois Field Polynomial Generator Function Register (GFPGFR) Field Descriptions
Bit Field Value Description
31-27 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
26-24 SIZE 0-7h Field size.
23-8 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
7-0 POLY 0-FFh Polynomial generator.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-17
Submit Documentation Feedback
2.8 Control Register File
Chapter 2—CPU Data Paths and Control www.ti.com
Note—Any write to ICR (by the MVC instruction) effectively has one delay slot
because the results cannot be read (by the MVC instruction) in IFR until two
cycles after the write to ICR.
Any write to ICR is ignored by a simultaneous write to the same bit in the
interrupt set register (ISR).
Reserved
R-0
15 14 13 12 11 10 9 8 7 6 5 4 3 0
IC15 IC14 IC13 IC12 IC11 IC10 IC9 IC8 IC7 IC6 IC5 IC4 Reserved
W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 R-0
LEGEND: R = Read only; W = Writable by the MVC instruction; -n = value after reset
2-18 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.8 Control Register File
www.ti.com Chapter 2—CPU Data Paths and Control
The IER is not accessible in User mode. See 9.2.4.1 on page 9-2 for more information.
See Chapter 6 ‘‘Interrupts’’ on page 6-1 for more information on interrupts.
Reserved
R-0
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
IE15 IE14 IE13 IE12 IE11 IE10 IE9 IE8 IE7 IE6 IE5 IE4 Reserved NMIE 1
R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R-0 R/W-0 R-1
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -n = value after reset
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-19
Submit Documentation Feedback
2.8 Control Register File
Chapter 2—CPU Data Paths and Control www.ti.com
Reserved
R-0
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
IF15 IF14 IF13 IF12 IF11 IF10 IF9 IF8 IF7 IF6 IF5 IF4 Reserved NMIF 0
R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0
LEGEND: R = Readable by the MVC instruction; -n = value after reset
The IRP contains the 32-bit address of the first execute packet in the program flow that
was not executed because of a maskable interrupt. Although you can write a value to
IRP, any subsequent interrupt processing may overwrite that value.
2-20 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.8 Control Register File
www.ti.com Chapter 2—CPU Data Paths and Control
IRP
R/W-x
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -x = value is indeterminate after reset
Note—Any write to ISR (by the MVC instruction) effectively has one delay slot
because the results cannot be read (by the MVC instruction) in IFR until two
cycles after the write to ISR.
Any write to the interrupt clear register (ICR) is ignored by a simultaneous
write to the same bit in ISR.
Reserved
R-0
15 14 13 12 11 10 9 8 7 6 5 4 3 0
IS15 IS14 IS13 IS12 IS11 IS10 IS9 IS8 IS7 IS6 IS5 IS4 Reserved
W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 R-0
LEGEND: R = Read only; W = Writable by the MVC instruction; -n = value after reset
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-21
Submit Documentation Feedback
2.8 Control Register File
Chapter 2—CPU Data Paths and Control www.ti.com
The ISTP is not accessible in User mode. See 9.2.4.1 on page 9-2 for more information.
See Chapter 6 ‘‘Interrupts’’ on page 6-1 for more information on interrupts.
Figure 2-11 Interrupt Service Table Pointer Register (ISTP)
31 16
ISTB
R/W-S
15 10 9 5 4 3 2 1 0
ISTB HPEINT 0 0 0 0 0
Table 2-15 Interrupt Service Table Pointer Register (ISTP) Field Descriptions
Bit Field Value Description
31-10 ISTB 0-3FFFFFh Interrupt service table base portion of the IST address. This field is cleared to a device-specific default value on
reset; therefore, upon startup the IST must reside at this specific address. See the device-specific data manual for
more information. After reset, you can relocate the IST by writing a new value to ISTB. If relocated, the first ISFP
(corresponding to RESET) is never executed via interrupt processing, because reset clears the ISTB to its default
value. See Example 6-1 on page 6-8.
9-5 HPEINT 0-1Fh Highest priority enabled interrupt that is currently pending. This field indicates the number (related bit position
in the IFR) of the highest priority interrupt (as defined in Table 6-1 on page 6-3) that is enabled by its bit in the
IER. Thus, the ISTP can be used for manual branches to the highest priority enabled interrupt. If no interrupt is
pending and enabled, HPEINT contains the value 0. The corresponding interrupt need not be enabled by NMIE
(unless it is NMI) or by GIE.
4-0 0 0 Cleared to 0 (fetch packets must be aligned on 8-word (32-byte) boundaries).
2-22 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.8 Control Register File
www.ti.com Chapter 2—CPU Data Paths and Control
The NRP contains the 32-bit address of the first execute packet in the program flow that
was not executed because of a nonmaskable interrupt. Although you can write a value
to NRP, any subsequent interrupt processing may overwrite that value.
See Chapter 6 ‘‘Interrupts’’ on page 6-1 for more information on interrupts. See
Chapter 7 ‘‘CPU Exceptions’’ on page 7-1 for more information on exceptions.
Figure 2-12 NMI Return Pointer Register (NRP)
31 0
NRP
R/W-x
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -x = value is indeterminate after reset
PCE1
R-x
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-23
Submit Documentation Feedback
2.9 Control Register File Extensions
Chapter 2—CPU Data Paths and Control www.ti.com
R-0 R-S
LEGEND: R = Readable by the MVC instruction; -n = value after reset; S = See the device-specific data manual for the default value of this field after reset
The ECR is not accessible in User mode. See 9.2.4.1 ‘‘Restricted Control Register
Access in User Mode’’ on page 9-2 for more information. See Chapter 7 ‘‘CPU
Exceptions’’ on page 7-1 for more information on exceptions.
2-24 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.9 Control Register File Extensions
www.ti.com Chapter 2—CPU Data Paths and Control
The EFR is not accessible in User mode. See 9.2.4.1 on page 9-2 for more information.
See Chapter 7 ‘‘CPU Exceptions’’ on page 7-1 for more information on exceptions.
15 2 1 0
LEGEND: R = Readable by the MVC EFR instruction only in Supervisor mode; W = Clearable by the MVC ECR instruction only in Supervisor mode; -n =
value after reset
32-bit polynomial
R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -n = value after reset
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-25
Submit Documentation Feedback
2.9 Control Register File Extensions
Chapter 2—CPU Data Paths and Control www.ti.com
32-bit polynomial
R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -n = value after reset
2-26 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.9 Control Register File Extensions
www.ti.com Chapter 2—CPU Data Paths and Control
The IERR is not accessible in User mode. See 9.2.4.1 on page 9-2 for more
information. See Chapter 7 ‘‘CPU Exceptions’’ on page 7-1 for more information on
exceptions.
Reserved MSX LBX PRX RAX RCX OPX EPX FPX IFX
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction only in Supervisor mode; W = Writable by the MVC instruction only in Supervisor mode; -n = value after
reset
Table 2-18 Internal Exception Report Register (IERR) Field Descriptions (Part 1 of 2) (Part 1 of 2)
Bit Field Value Description
31-9 Reserved 0 Reserved. Read as 0.
8 MSX Missed stall exception
0 Missed stall exception is not the cause.
1 Missed stall exception is the cause.
7 LBX SPLOOP buffer exception
0 SPLOOP buffer exception is not the cause.
1 SPLOOP buffer exception is the cause.
6 PRX Privilege exception
0 Privilege exception is not the cause.
1 Privilege exception is the cause.
5 RAX Resource access exception
0 Resource access exception is not the cause.
1 Resource access exception is the cause.
4 RCX Resource conflict exception
0 Resource conflict exception is not the cause.
1 Resource conflict exception is the cause.
3 OPX Opcode exception
0 Opcode exception is not the cause.
1 Opcode exception is the cause.
2 EPX Execute packet exception
0 Execute packet exception is not the cause.
1 Execute packet exception is the cause.
1 FPX Fetch packet exception
0 Fetch packet exception is not the cause.
1 Fetch packer exception is the cause.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-27
Submit Documentation Feedback
2.9 Control Register File Extensions
Chapter 2—CPU Data Paths and Control www.ti.com
Table 2-18 Internal Exception Report Register (IERR) Field Descriptions (Part 2 of 2) (Part 2 of 2)
Bit Field Value Description
0 IFX Instruction fetch exception
0 Instruction fetch exception is not the cause.
1 Instruction fetch exception is the cause.
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -n = value after reset
The GIE bit in ITSR is physically the same bit as the PGIE bit in CSR.
The ITSR is not accessible in User mode. See 9.2.4.1 on page 9-2 for more
information.
R-0
15 14 13 11 10 9 8 7 6 5 4 3 2 1 0
IB SPLX Reserved EXC INT Rsvd CXM Reserved XEN GEE SGIE GIE
R/W-0 R/W-0 R-0 R/W-0 R/W-0 R-0 R/W-0 R-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction only in Supervisor mode; W = Writable by the MVC instruction only in Supervisor mode; -n = value after
reset
2-28 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.9 Control Register File Extensions
www.ti.com Chapter 2—CPU Data Paths and Control
The NTSR is not accessible in User mode. See 9.2.4.1 on page 9-2 for more
information.
Reserved HWE
R-0 R/W-0
15 14 13 11 10 9 8 7 6 5 4 3 2 1 0
IB SPLX Reserved EXC INT Rsvd CXM Reserved XEN GEE SGIE GIE
R/W-0 R/W-0 R-0 R/W-0 R/W-0 R-0 R/W-0 R-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction only in Supervisor mode; W = Writable by the MVC instruction only in Supervisor mode; -n = value after
reset
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-29
Submit Documentation Feedback
2.9 Control Register File Extensions
Chapter 2—CPU Data Paths and Control www.ti.com
R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -n = value after reset
Instructions resulting in saturation set the appropriate unit flag in SSR in the cycle
following the writing of the result to the register file. The setting of the flag from
a functional unit takes precedence over a write to the bit from an MVC instruction. If
no functional unit saturation has occurred, the flags may be set to 0 or 1 by the MVC
instruction, unlike the SAT bit in CSR.
The bits in SSR can be set by the MVC instruction or by a saturation in the associated
functional unit. The bits are cleared only by a reset or by the MVC instruction. The bits
are not cleared by the occurrence of a nonsaturating instruction.
Reserved M2 M1 S2 S1 L2 L1
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -n = value after reset
2-30 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.9 Control Register File Extensions
www.ti.com Chapter 2—CPU Data Paths and Control
R-0
R-0
2.9.13.1 Initialization
The counter is cleared to 0 after reset, and counting is disabled.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-31
Submit Documentation Feedback
2.9 Control Register File Extensions
Chapter 2—CPU Data Paths and Control www.ti.com
When reading the full 64-bit value, it must be ensured that no interrupts are serviced
between the two MVC instructions if an ISR is allowed to make use of the time stamp
counter. There is no way for an ISR to restore the previous value of TSCH (snapshot)
if it reads TSCL, since a new snapshot is performed.
Two methods for reading the 64-bit count value in an uninterruptible manner are
shown in Example 2-1 and Example 2-2. Example 2-1 uses the fact that interrupts are
automatically disabled in the delay slots of a branch to prevent an interrupt from
happening between the TSCL read and the TSCH read. Example 2-2 accomplishes the
same task by explicitly disabling interrupts.
Example 2-1 Code to Read the 64-Bit TSC Value in Branch Delay Slot
BNOP TSC_Read_Done, 3
MVC TSCL,A0 ; Read the low half first; high half copied to TSCH
MVC TSCH,A1 ; Read the snapshot of the high half
TSC_Read_Done:
Example 2-2 Code to Read the 64-Bit TSC Value Using DINT/RINT
DINT
|| MVC TSCL,A0 ; Read the low half first; high half copied to TSCH
RINT
|| MVC TSCH,A1 ; Read the snapshot of the high half
TSC_Read_Done:
2-32 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.9 Control Register File Extensions
www.ti.com Chapter 2—CPU Data Paths and Control
The GIE and SGIE bits may be written in both User mode and Supervisor mode. The
remaining bits all have restrictions on how they are written. See 9.2.4.2 ‘‘Partially
Restricted Control Register Access in User Mode’’ on page 9-3 for more information.
The GIE bit in TSR is physically the same bit as the GIE bit in CSR. It is retained in CSR
for compatibility reasons, but placed in TSR so that it will be copied in the event of
either an exception or an interrupt.
Reserved
R-0
15 14 13 11 10 9 8 7 6 5 4 3 2 1 0
IB SPLX Reserved EXC INT Rsvd CXM Reserved XEN GEE SGIE GIE
R-0 R-0 R-0 R/C-0 R-0 R-0 R/W-0 R-0 R/W-0 R/S-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writable in Supervisor mode; C = Clearable in Supervisor mode; S = Can be set in Supervisor mode;
-n = value after reset
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-33
Submit Documentation Feedback
2.10 Control Register File Extensions for Floating-Point Operations
Chapter 2—CPU Data Paths and Control www.ti.com
The OVER, UNDER, INEX, INVAL, DENn, NANn, INFO, UNORD and DIV0 bits
within these registers will not be modified by a conditional instruction whose condition
is false.
FMCR specifies the desired floating-point rounding mode and contains the warning
bits for the instructions that use the .M units.
2-34 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.10 Control Register File Extensions for Floating-Point Operations
www.ti.com Chapter 2—CPU Data Paths and Control
FAUCR specifies the desired floating-point rounding mode and contains the warning
bits for the instructions that use the .S units.
FADCR specifies the desired floating-point rounding mode and contains the warning
bits for the instructions that use the .L units and for instructions that can be executed
on both .L and .S units. The warning bits in FADCR are the logical-OR of the warnings
produced on the .L functional unit and the warnings produced by the instructions that
can be executed on both .L and .S unit.
Therefore the following instructions executing in the .S functional unit use the
rounding mode from and set the warning bits in FADCR (and not in FAUCR as other
.S unit instructions do):
• ADDSP / SUBDP
• ADDDP / SUBDP
• FADDSP / FSUBSP
• FADDDP / FSUBDP
• DINTHSP
• DINTHSPU
• DSPINTH
• DSPINT
• INTSPU
• INTSP
• SPINT
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
15 11 10 9 8 7 6 5 4 3 2 1 0
Reserved RMODE UNDER INEX OVER INFO INVAL DEN2 DEN1 NAN2 NAN1
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -n = value after reset
Table 2-23 Floating-Point Adder Configuration Register (FADCR) Field Descriptions (Part 1 of 3)
Bit Field Value Description
31-27 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-35
Submit Documentation Feedback
2.10 Control Register File Extensions for Floating-Point Operations
Chapter 2—CPU Data Paths and Control www.ti.com
Table 2-23 Floating-Point Adder Configuration Register (FADCR) Field Descriptions (Part 2 of 3)
Bit Field Value Description
26-25 RMODE 0-3h Rounding mode select for .L2.
0 Round toward nearest representable floating-point number
1h Round toward 0 (truncate)
2h Round toward infinity (round up)
3h Round toward negative infinity (round down)
24 UNDER Result underflow status for .L2.
0 Result does not underflow.
1 Result underflows.
23 INEX Inexact results status for .L2.
0
1 Result differs from what would have been computed had the exponent range and precision been unbounded;
never set with INVAL.
22 OVER Result overflow status for .L2.
0 Result does not overflow.
1 Result overflows.
21 INFO Signed infinity for .L2.
0 Result is not signed infinity.
1 Result is signed infinity.
20 INVAL
0 A signed NaN (SNaN) is not a source.
1 A signed NaN (SNaN) is a source. NaN is a source in a floating-point to integer conversion or when infinity is
subtracted from infinity.
19 DEN2 Denormalized number select for .L2 src2.
0 src2 is not a denormalized number.
1 src2 is a denormalized number.
18 DEN1 Denormalized number select for .L2 src1.
0 src1 is not a denormalized number.
1 src1 is a denormalized number.
17 NAN2 NaN select for .L2 src2.
0 src2 is not NaN.
1 src2 is NaN.
16 NAN1 NaN select for .L2 src1.
0 src1 is not NaN.
1 src1 is NaN.
15-11 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
10-9 RMODE 0-3h Rounding mode select for .L1.
0 Round toward nearest representable floating-point number
1h Round toward 0 (truncate)
2h Round toward infinity (round up)
3h Round toward negative infinity (round down)
8 UNDER Result underflow status for .L1.
0 Result does not underflow.
1 Result underflows.
2-36 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.10 Control Register File Extensions for Floating-Point Operations
www.ti.com Chapter 2—CPU Data Paths and Control
Table 2-23 Floating-Point Adder Configuration Register (FADCR) Field Descriptions (Part 3 of 3)
Bit Field Value Description
7 INEX Inexact results status for .L1.
0
1 Result differs from what would have been computed had the exponent range and precision been unbounded;
never set with INVAL.
6 OVER Result overflow status for .L1.
0 Result does not overflow.
1 Result overflows.
5 INFO Signed infinity for .L1.
0 Result is not signed infinity.
1 Result is signed infinity.
4 INVAL
0 A signed NaN (SNaN) is not a source.
1 A signed NaN (SNaN) is a source. NaN is a source in a floating-point to integer conversion or when infinity is
subtracted from infinity.
3 DEN2 Denormalized number select for .L1 src2.
0 src2 is not a denormalized number.
1 src2 is a denormalized number.
2 DEN1 Denormalized number select for .L1 src1.
0 src1 is not a denormalized number.
1 src1 is a denormalized number.
1 NAN2 NaN select for .L1 src2.
0 src2 is not NaN.
1 src2 is NaN.
0 NAN1 NaN select for .L1 src1.
0 src1 is not NaN.
1 src1 is NaN.
Reserved DIV0 UNORD UND INEX OVER INFO INVAL DEN2 DEN1 NAN2 NAN1
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
15 11 10 9 8 7 6 5 4 3 2 1 0
Reserved DIV0 UNORD UND INEX OVER INFO INVAL DEN2 DEN1 NAN2 NAN1
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -n = value after reset
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-37
Submit Documentation Feedback
2.10 Control Register File Extensions for Floating-Point Operations
Chapter 2—CPU Data Paths and Control www.ti.com
Table 2-24 Floating-Point Auxiliary Configuration Register (FAUCR) Field Descriptions (Part 1 of 2)
Bit Field Value Description
31-27 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
26 DIV0 Source to reciprocal operation for .S2.
0 0 is not source to reciprocal operation.
1 0 is source to reciprocal operation.
25 UNORD Source to a compare operation for .S2
0 NaN is not a source to a compare operation.
1 NaN is a source to a compare operation.
24 UND Result underflow status for .S2.
0 Result does not underflow.
1 Result underflows.
23 INEX Inexact results status for .S2.
0
1 Result differs from what would have been computed had the exponent range and precision been
unbounded; never set with INVAL.
22 OVER Result overflow status for .S2.
0 Result does not overflow.
1 Result overflows.
21 INFO Signed infinity for .S2.
0 Result is not signed infinity.
1 Result is signed infinity.
20 INVAL
0 A signed NaN (SNaN) is not a source.
1 A signed NaN (SNaN) is a source. NaN is a source in a floating-point to integer conversion or when infinity is
subtracted from infinity.
19 DEN2 Denormalized number select for .S2 src2.
0 src2 is not a denormalized number.
1 src2 is a denormalized number.
18 DEN1 Denormalized number select for .S2 src1.
0 src1 is not a denormalized number.
1 src1 is a denormalized number.
17 NAN2 NaN select for .S2 src2.
0 src2 is not NaN.
1 src2 is NaN.
16 NAN1 NaN select for .S2 src1.
0 src1 is not NaN.
1 src1 is NaN.
15-11 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
10 DIV0 Source to reciprocal operation for .S1.
0 0 is not source to reciprocal operation.
1 0 is source to reciprocal operation.
9 UNORD Source to a compare operation for .S1
0 NaN is not a source to a compare operation.
1 NaN is a source to a compare operation.
2-38 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.10 Control Register File Extensions for Floating-Point Operations
www.ti.com Chapter 2—CPU Data Paths and Control
Table 2-24 Floating-Point Auxiliary Configuration Register (FAUCR) Field Descriptions (Part 2 of 2)
Bit Field Value Description
8 UND Result underflow status for .S1.
0 Result does not underflow.
1 Result underflows.
7 INEX Inexact results status for .S1.
0
1 Result differs from what would have been computed had the exponent range and precision been
unbounded; never set with INVAL.
6 OVER Result overflow status for .S1.
0 Result does not overflow.
1 Result overflows.
5 INFO Signed infinity for .S1.
0 Result is not signed infinity.
1 Result is signed infinity.
4 INVAL
0 A signed NaN (SNaN) is not a source.
1 A signed NaN (SNaN) is a source. NaN is a source in a floating-point to integer conversion or when infinity is
subtracted from infinity.
3 DEN2 Denormalized number select for .S1 src2.
0 src2 is not a denormalized number.
1 src2 is a denormalized number.
2 DEN1 Denormalized number select for .S1 src1.
0 src1 is not a denormalized number.
1 src1 is a denormalized number.
1 NAN2 NaN select for .S1 src2.
0 src2 is not NaN.
1 src2 is NaN.
0 NAN1 NaN select for .S1 src1.
0 src1 is not NaN.
1 src1 is NaN.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-39
Submit Documentation Feedback
2.10 Control Register File Extensions for Floating-Point Operations
Chapter 2—CPU Data Paths and Control www.ti.com
Reserved RMODE UNDER INEX OVER INFO INVAL DEN2 DEN1 NAN2 NAN1
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
15 11 10 9 8 7 6 5 4 3 2 1 0
Reserved RMODE UNDER INEX OVER INFO INVAL DEN2 DEN1 NAN2 NAN1
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writable by the MVC instruction; -n = value after reset
Table 2-25 Floating-Point Multiplier Configuration Register (FMCR) Field Descriptions (Part 1 of 2)
Bit Field Value Description
31-27 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
26-25 RMODE 0-3h Rounding mode select for .M2.
0 Round toward nearest representable floating-point number
1h Round toward 0 (truncate)
2h Round toward infinity (round up)
3h Round toward negative infinity (round down)
24 UNDER Result underflow status for .M2.
0 Result does not underflow.
1 Result underflows.
23 INEX Inexact results status for .M2.
0
1 Result differs from what would have been computed had the exponent range and precision been unbounded;
never set with INVAL.
22 OVER Result overflow status for .M2.
0 Result does not overflow.
1 Result overflows.
21 INFO Signed infinity for .M2.
0 Result is not signed infinity.
1 Result is signed infinity.
20 INVAL
0 A signed NaN (SNaN) is not a source.
1 A signed NaN (SNaN) is a source. NaN is a source in a floating-point to integer conversion or when infinity is
subtracted from infinity.
19 DEN2 Denormalized number select for .M2 src2.
0 src2 is not a denormalized number.
1 src2 is a denormalized number.
18 DEN1 Denormalized number select for .M2 src1.
0 src1 is not a denormalized number.
1 src1 is a denormalized number.
2-40 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
2.10 Control Register File Extensions for Floating-Point Operations
www.ti.com Chapter 2—CPU Data Paths and Control
Table 2-25 Floating-Point Multiplier Configuration Register (FMCR) Field Descriptions (Part 2 of 2)
Bit Field Value Description
17 NAN2 NaN select for .M2 src2.
0 src2 is not NaN.
1 src2 is NaN.
16 NAN1 NaN select for .M2 src1.
0 src1 is not NaN.
1 src1 is NaN.
15-11 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
10-9 RMODE 0-3h Rounding mode select for .M1.
0 Round toward nearest representable floating-point number
1h Round toward 0 (truncate)
2h Round toward infinity (round up)
3h Round toward negative infinity (round down)
8 UNDER Result underflow status for .M1.
0 Result does not underflow.
1 Result underflows.
7 INEX Inexact results status for .M1.
0
1 Result differs from what would have been computed had the exponent range and precision been unbounded;
never set with INVAL.
6 OVER Result overflow status for .M1.
0 Result does not overflow.
1 Result overflows.
5 INFO Signed infinity for .M1.
0 Result is not signed infinity.
1 Result is signed infinity.
4 INVAL
0 A signed NaN (SNaN) is not a source.
1 A signed NaN (SNaN) is a source. NaN is a source in a floating-point to integer conversion or when infinity is
subtracted from infinity.
3 DEN2 Denormalized number select for .M1 src2.
0 src2 is not a denormalized number.
1 src2 is a denormalized number.
2 DEN1 Denormalized number select for .M1 src1.
0 src1 is not a denormalized number.
1 src1 is a denormalized number.
1 NAN2 NaN select for .M1 src2.
0 src2 is not NaN.
1 src2 is NaN.
0 NAN1 NaN select for .M1 src1.
0 src1 is not NaN.
1 src1 is NaN.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 2-41
Submit Documentation Feedback
2.10 Control Register File Extensions for Floating-Point Operations
Chapter 2—CPU Data Paths and Control www.ti.com
2-42 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Chapter 3
Instruction Set
This chapter describes the assembly language instructions of the TMS320C66x DSP.
Also described are parallel operations, conditional operations, resource constraints,
and addressing modes.
The C66x DSP uses all of the instructions available to the TMS320C62x, TMS320C64x,
TMS320C64x+, and TMS320C674x+ DSPs. The C664x DSP instructions include 8-bit
and 16-bit extensions, nonaligned word loads and stores, data packing/unpacking
operations.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-1
Submit Documentation Feedback
3.1 Instruction Operation and Execution Notation
Chapter 3—Instruction Set www.ti.com
3-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.1 Instruction Operation and Execution Notation
www.ti.com Chapter 3—Instruction Set
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-3
Submit Documentation Feedback
3.2 Instruction Syntax and Opcode Notations
Chapter 3—Instruction Set www.ti.com
3-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.2 Instruction Syntax and Opcode Notations
www.ti.com Chapter 3—Instruction Set
p parallel execution; 0 = next instruction is not executed in parallel, 1 = next instruction is executed in parallel
ptr offset from either A4-A7 or B4-B7 depending on the value of the s bit. The ptr field is the 2 least-significant bits of the src2 (baseR)
field—bit 2 of register address is forced to 1.
r LDDW/LDNDW/LDNW instruction
rsv reserved
s side A or B for destination; 0 = side A, 1 = side B.
sc scaling mode; 0 = nonscaled, offsetR/ucst5 is not shifted; 1 = scaled, offsetR/ucst5 is shifted
scstn n-bit signed constant field
sn sign
src source
src1 source 1
src2 source 2
sz data size select; 0 = primary size, 1 = secondary size (see ‘‘Expansion Field in Compact Header Word’’ on page 3-31)
t side of source/destination (src/dst) register; 0 = side A, 1 = side B
ucstn n-bit unsigned constant field
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-5
Submit Documentation Feedback
3.3 Overview of IEEE Standard Single- and Double-Precision Formats
Chapter 3—Instruction Set www.ti.com
Instructions that use DP sources fall in two categories: instructions that read the upper
and lower 32-bit words on separate cycles, and instructions that read both 32-bit words
on the same cycle. All instructions that produce a double-precision result write the low
32-bit word one cycle before writing the high 32-bit word. If an instruction that writes
a DP result is followed by an instruction that uses the result as its DP source and it reads
the upper and lower words on separate cycles, then the second instruction can be
scheduled on the same cycle that the high 32-bit word of the result is written. The lower
result is written on the previous cycle. This is because the second instruction reads the
low word of the DP source one cycle before the high word of the DP source.
Normal single-precision values are always accurate to at least six decimal places,
sometimes up to nine decimal places. Normal double-precision values are always
accurate to at least 15 decimal places, sometimes up to 17 decimal places.
3-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.3 Overview of IEEE Standard Single- and Double-Precision Formats
www.ti.com Chapter 3—Instruction Set
s e f
LEGEND: s = sign bit (0 = positive, 1 = negative); e = 8-bit exponent ( 0 < e < 255);
f = 23-bit fraction (0 < f < 1 × 2-1 + 1 × 2-2 + ... + 1 × 2-23 or 0 < f < ((223) - 1)/(223)
Table 3-4 shows the s, e, and f values for special single-precision floating-point
numbers.
Table 3-4 Special Single-Precision Values
Symbol Sign (s) Exponent (e) Fraction (f)
+0 0 0 0
-0 1 0 0
+Inf 0 255 0
-Inf 1 255 0
NaN x 255 nonzero
QNaN x 255 1xx..x
SNaN x 255 0xx..x and nonzero
Table 3-5 shows hexadecimal and decimal values for some single-precision
floating-point numbers.
Table 3-5 Hexadecimal and Decimal Representation for Selected Single-Precision Values
Symbol Hex Value Decimal Value
NaN_out 7FFFFFFF QNaN
0 00000000 0.0
-0 80000000 -0.0
1 3F800000 1.0
2 40000000 2.0
LFPN 7F7FFFFF 3.40282347e+38
SFPN 00800000 1.17549435e-38
LDFPN 007FFFFF 1.17549421e-38
SDFPN 00000001 1.40129846e-45
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-7
Submit Documentation Feedback
3.3 Overview of IEEE Standard Single- and Double-Precision Formats
Chapter 3—Instruction Set www.ti.com
s e f
Odd register
31 0
Even register
LEGEND: s = sign bit (0 = positive, 1 = negative); e = 11-bit exponent ( 0 < e < 2047);
f = 52-bit fraction (0 < f < 1 × 2-1 + 1 × 2-2 + ... + 1 × 2-52 or 0 < f < ((252) - 1)/(252)
Table 3-6 shows the s, e, and f values for special double-precision floating-point
numbers.
Table 3-6 Special Double-Precision Values
Symbol Sign (s) Exponent (e) Fraction (f)
+0 0 0 0
-0 1 0 0
+Inf 0 2047 0
-Inf 1 2047 0
NaN x 2047 nonzero
QNaN x 2047 1xx..x
SNaN x 2047 0xx..x and nonzero
Table 3-7 shows hexadecimal and decimal values for some double-precision
floating-point numbers.
Table 3-7 Hexadecimal and Decimal Representation for Selected Double-Precision Values
Symbol Hex Value Decimal Value
NaN_out 7FFFFFFFFFFFFFFF QNaN
0 0000000000000000 0.0
-0 8000000000000000 -0.0
1 3FF0000000000000 1.0
2 4000000000000000 2.0
LFPN 7FEFFFFFFFFFFFFF 1.7976931348623157e+308
3-8 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.4 Delay Slots
www.ti.com Chapter 3—Instruction Set
Table 3-7 Hexadecimal and Decimal Representation for Selected Double-Precision Values
Symbol Hex Value Decimal Value
SFPN 0010000000000000 2.2250738585072014e-308
LDFPN 000FFFFFFFFFFFFF 2.2250738585072009e-308
SDFPN 0000000000000001 4.9406564584124654e-324
The number of delay slots is equivalent to the number of additional cycles required
after the source operands are read for the result to be available for reading.
The functional unit latency is equivalent to the number of cycles that must pass before
the functional unit can start executing the next instruction.
The C66x is fully binary compatible with the C64x+/C674x. Therefore, the number of
delay slots and the latencies of all C64x+/C674x instructions are unchanged on the
C66x.
All new floating point instructions have a functional unit latency of one cycle (can be
fully pipelined). Improvements of the existing floating have also been realized, but in
order to maintain a fully backward binary compatibility, new instructions opcode have
been created.
• FMPYDP is the optimized version of MPYDP
• FADDSP/FSUBSP are the optimized version of ADDSP/SUBSP
• FADDDP/FADDDP are the optimized version of ADDDP/SUBDP
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-9
Submit Documentation Feedback
3.4 Delay Slots
Chapter 3—Instruction Set www.ti.com
Table 3-8 shows the number of delay slots associated with each type of instruction.
.
Table 3-8 Delay Slot and Functional Unit Latency
Instruction Functional Unit
type Delay Slots Latency Read Cycles1 Write Cycles Branch Taken
NOP 0 1
Store 0 1 i i
Load 4 1 i i, i+4 2
Branch 5 1 i3 i+5
Single cycle 0 1 i i
2-cycle 1 1 i i+1
3-cycle 2 1 i i+2
4-cycle 3 1 i i+3
DP compare 1 2 i, i+1 i+1
2-cycle DP 1 1 i i+1
INTDP 4 1 i i+3,i+4
MPYSP2DP 4 2 i i+3,i+4
ADDDP/SUBDP 6 2 i, i+1 i+5,i+6
MPYSPDP 6 3 i, i+1 i+5,i+6
MPYI 8 4 i, i+1, i+2, i+3 i+8
MPYID 9 4 i, i+1, i+2, i+3 i+8, i+9
MPYD 9 4 i, i+1, i+2, i+3 i+8, i+9
1. Cycle i is in the E1 pipeline phase.
2. For loads, any address modification happens in cycle i. The loaded data is written into the register file in cycle (i+4)
3. The branch to label, branch to IRP, and branch to NRP instructions do not read any general purpose registers
3-10 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.5 Parallel Operations
www.ti.com Chapter 3—Instruction Set
The CPU supports compact 16-bit instructions. Unlike the normal 32-bit instructions,
the p-bit information for compact instructions is not contained within the instruction
opcode. Instead, the p-bit is contained within the p-bits field within the fetch packet
header. See Section 3.10 on page 3-29 for more information.
On the CPU, the execute packet can cross fetch packet boundaries, but will be limited
to no more than eight instructions in a fetch packet. The last instruction in an execute
packet will be marked with its p-bit cleared to zero. There are three types of p-bit
patterns for fetch packets. These three p-bit patterns result in the following execution
sequences for the eight instructions:
• Fully serial
• Fully parallel
• Partially serial
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-11
Submit Documentation Feedback
3.5 Parallel Operations
Chapter 3—Instruction Set www.ti.com
Example 3-1 through Example 3-3 show the conversion of a p-bit sequence into a
cycle-by-cycle execution stream of instructions.
3-12 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.5 Parallel Operations
www.ti.com Chapter 3—Instruction Set
instruction A
instruction B
instruction C
|| instruction D
|| instruction E
instruction F
|| instruction G
|| instruction H
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-13
Submit Documentation Feedback
3.6 Conditional Operations
Chapter 3—Instruction Set www.ti.com
Compact (16-bit) instructions on the DSP do not contain a creg field and always
execute unconditionally. See ‘‘Compact Instructions on the CPU’’ on page 3-29 for
more information.
Table 3-9 Registers That Can Be Tested by Conditional Operations
Specified creg z
Conditional
Register Bit: 31 30 29 28
Unconditional 0 0 0 0
Reserved 0 0 0 1
B0 0 0 1 z
B1 0 1 0 z
B2 0 1 1 z
A1 1 0 0 z
A2 1 0 1 z
A0 1 1 0 z
Reserved 1 1 x1 x1
1. x can be any value.
The above instructions are mutually exclusive, only one will execute. If they are
scheduled in parallel, mutually exclusive instructions are constrained as described in
Section 3.8 . If mutually exclusive instructions share any resources as described in
Section 3.8 , they cannot be scheduled in parallel (put in the same execute packet), even
though only one will execute.
The act of making an instruction conditional is often called predication and the
conditional register is often called the predication register.
3-14 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.7 SPMASKed Operations
www.ti.com Chapter 3—Instruction Set
See Chapter 8 ‘‘Software Pipelined Loop (SPLOOP) Buffer’’ on page 8-1 for more
information.
3.8.2 Constraints on the Same Functional Unit Writing in the Same Instruction Cycle
On the C64x+/C674x, the .M unit has two 32-bit write ports; so the results of a 4-cycle
32-bit instruction and a 2-cycle 32-bit instruction operating on the same .M unit can
write their results on the same instruction cycle. Any other combination of parallel
writes (a 2-cycle instruction writing a 32-bit result and a 4-cycle instruction writing a
64-bit results) on the .M unit will result in a conflict. On the C674x DSP this will result
in an exception.
On the C66x, the .M unit has two 64-bit write ports to the register file and the results of
a 4-cycle instruction and a 2-cycle instruction operating on the same .M unit can write
their results on the same instruction cycle even when the 4-cycle instruction writes a
64-bit results (like a CMPY for example).
However, a 4-cycle instruction (that writes a 128-bit result (like a CMATMPY for
example)) can not write its results in the same instruction cycle as a 2-cycle instruction
operating on the same .M unit. On the C66x DSP, this will result in an exception and
erroneous values being written to the destination registers.
For example, the following sequence is valid and results in A3:A2 and A5 being written
by the .M1 unit on the same cycle.
CMPY.M1 A0,A1,A3:A2 ; This instruction has 3 delay slots
; and generates a 64 bit result
NOP
AVG2 .M1 A4,A5 ; This instruction has 1 delay slot
NOP ; A3:A2 and A5 get written on this cycle
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-15
Submit Documentation Feedback
3.8 Resource Constraints
Chapter 3—Instruction Set www.ti.com
The following sequence is invalid. The attempt to write 160 bits of output through
128-bits of write port will fail. An exception is generated.
CMATMPY .M1 A3:A2, A7:A6:A5:A4, A11:A10:A9:A8; This instruction has
; 3 delay slots but
; writes a 128-bit
; result
NOP
MPY .M1 A1,A2,A3 ;This instruction has 1 delay slot
NOP
Even though the .L/.S units can also execute 4-cycle and 2-cycle instructions, two
independent writes from the same .L or .S unit to the register file onto the same
instruction cycle is not supported and will result in an exception and erroneous values
being written to the destination registers.
Therefore, the following sequence is invalid since A4 and A5 are written by the .L1 unit
on the same cycle.
INTSPU .L1A1, A5 ; this instruction has a three delay
; slot (4-cycle instruction)
NOP
DSPINTH.L1A3:A2, A4 ; this instruction has a one delay slot
; (2-cycle instruction)
NOP
For example the following sequence is invalid on C64x+/C674x and valid on the C66x:
ADD .L1 A0, B0, A1
|| ADD .S1 A2, B0, A2
|| ADD .D1 A3, B0, A3
|| MPY .M1 A4, B0, A4
3-16 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.8 Resource Constraints
www.ti.com Chapter 3—Instruction Set
It is possible to avoid the cross path stall by scheduling an instruction that reads an
operand via the cross path at least one cycle after the operand is updated. With
appropriate scheduling, the DSP can provide one cross path operand per data path per
cycle with no stalls. In many cases, the TMS320C6000 Optimizing Compiler and
Assembly Optimizer automatically perform this scheduling.
The DA1 and DA2 resources and their associated data paths are specified as T1 and T2,
respectively. T1 consists of the DA1 address path and the LD1 and ST1 data paths. LD1
is comprised of LD1a and LD1b to support 64-bit loads; ST1 is comprised of ST1a and
ST1b to support 64-bit stores. Similarly, T2 consists of the DA2 address path and the
LD2 and ST2 data paths. LD2 is comprised of LD2a and LD2b to support 64-bit loads;
ST2 is comprised of ST2a and ST2b to support 64-bit stores. The T1 and T2
designations appear in the functional unit fields for load and store instructions.
The DSP can access words and doublewords at any byte boundary using nonaligned
loads and stores. As a result, word and doubleword data does not need alignment to
32-bit or 64-bit boundaries. No other memory access may be used in parallel with a
nonaligned memory access. The other .D unit can be used in parallel, as long as it is not
performing a memory access.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-17
Submit Documentation Feedback
3.8 Resource Constraints
Chapter 3—Instruction Set www.ti.com
The C674x CPU maintains separate data paths to each functional unit, so these
constraints are removed.
The following execute packets example are invalid on C64x+/C674x and valid on the
C66x:
MPY .M1 A1, A1, A4 ; five reads of register A1
|| ADD .L1 A1, A1, A5
|| SUB .D1 A1, A2, A3
Figure 3-4 shows different multiple-write conflicts. For example, ADD and SUB in
execute packet L1 write to the same register. This conflict is easily detectable.
3-18 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.8 Resource Constraints
www.ti.com Chapter 3—Instruction Set
conflict because they are mutually exclusive. In contrast, because the instructions in L5
may or may not be mutually exclusive, the assembler cannot determine a conflict. If the
pipeline does receive commands to perform multiple writes to the same register, the
result is undefined.
Figure 3-4 Examples of the Detectability of Write Conflicts by the Assembler
L1: ADD .L2 B5, B6, B7 ; \ detectable, conflict
|| SUB .S2 B8, B9, B7 ; /
L2: MPY .M2 B0, B1, B2 ; \ not detectable
L3: ADD .L2 B3, B4, B2 ; /
L4: [!B0]ADD.L2 B5, B6, B7 ; \ detectable, no conflict
|| [B0] SUB.S2 B8, B9, B7 ; /
L5: [!B1]ADD.L2 B5, B6, B7 ; \ not detectable
|| [B0] SUB.S2 B8, B9, B7 ; /
To determine if all the memory transactions are completed, the MFENCE instruction
checks an internal busy flag. MFENCE always wait at least 5 clock cycles before
checking the busy flag in order to account for pipeline delays.
During the course of executing a MFENCE operation, any enabled interrupts will still
be serviced.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-19
Submit Documentation Feedback
3.8 Resource Constraints
Chapter 3—Instruction Set www.ti.com
When an interrupt occurs during the execution of a MFENCE instruction, the address
of the execute packet containing the MFENCE instruction is saved in IRP or NRP. This
forces returning to the MFENCE instruction after interrupt servicing.
See Chapter 8 ‘‘Software Pipelined Loop (SPLOOP) Buffer’’ on page 8-1 for more
information.
3-20 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.8 Resource Constraints
www.ti.com Chapter 3—Instruction Set
3.8.12.5 IDLE
An IDLE instruction cannot be placed in parallel with the following instructions:
• DINT
• NOP n (if n > 1)
• RINT
• SPKERNEL(R)
• SPLOOP(D/W)
• SPMASK(R)
• SWE
• SWENR
3.8.12.6 NOP n
A NOPn (with n > 1) instruction cannot be placed in parallel with other multicycle
NOP counts (ADDKPC, BNOP, CALLP) with the exception of another NOPn where
the NOP count is the same. A NOPn (with n > 1) instruction cannot be placed in
parallel with the following instructions:
• DINT
• IDLE
• RINT
• SPKERNEL(R)
• SPLOOP(D/W)
• SPMASK(R)
• SWE
• SWENR
3.8.12.7 RINT
A RINT instruction cannot be placed in parallel with the following instructions:
• MVC reg, TSR
• MVC reg, CSR
• B IRP
• B NRP
• DINT
• IDLE
• NOP n (if n > 1)
• SPKERNEL(R)
• SPLOOP(D/W)
• SPMASK(R)
• SWE
• SWENR
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-21
Submit Documentation Feedback
3.8 Resource Constraints
Chapter 3—Instruction Set www.ti.com
3.8.12.8 SPKERNEL(R)
An SPKERNEL(R) instruction cannot be placed in parallel with the following
instructions:
• DINT
• IDLE
• NOP n (if n > 1)
• RINT
• SPLOOP(D/W)
• SPMASK(R)
• SWE
• SWENR
3.8.12.9 SPLOOP(D/W)
An SPLOOP(D/W) instruction cannot be placed in parallel with the following
instructions:
• DINT
• IDLE
• NOP n (if n > 1)
• RINT
• SPKERNEL(R)
• SPMASK(R)
• SWE
• SWENR
3.8.12.10 SPMASK(R)
An SPMASK(R) instruction cannot be placed in parallel with the following
instructions:
• DINT
• IDLE
• NOP n (if n > 1)
• RINT
• SPLOOP(D/W)
• SPKERNEL(R)
• SWE
• SWENR
3.8.12.11 SWE
An SWE instruction cannot be placed in parallel with the following instructions:
• DINT
• IDLE
• NOP n (if n > 1)
3-22 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.8 Resource Constraints
www.ti.com Chapter 3—Instruction Set
• RINT
• SPLOOP(D/W)
• SPKERNEL(R)
• SWENR
3.8.12.12 SWENR
An SWENR instruction cannot be placed in parallel with the following instructions:
• DINT
• IDLE
• NOP n (if n > 1)
• RINT
• SPLOOP(D/W)
• SPKERNEL(R)
• SWE
An instruction of the following types scheduled on cycle I has the following constraints:
DP compare No other instruction can use the functional unit on cycles I and I + 1.
ADDDP/SUBDP No other instruction can use the functional unit on cycles I and I + 1.
MPYI No other instruction can use the functional unit on cycles I, I + 1, I + 2, and I + 3.
MPYID No other instruction can use the functional unit on cycles I, I + 1, I + 2, and I + 3.
MPYDP No other instruction can use the functional unit on cycles I, I + 1, I + 2, and I + 3.
If a cross path is used to read a source in an instruction with a multicycle functional unit
latency, you must ensure that no other instructions executing on the same side uses the
cross path.
An instruction of the following types scheduled on cycle I using a cross path to read a
source, has the following constraints:
DP compare No other instruction on the same side can used the cross path on cycles I and I + 1.
ADDDP/SUBDP No other instruction on the same side can use the cross path on cycles I and I + 1.
MPYI No other instruction on the same side can use the cross path on cycles I, I + 1, I + 2, and I + 3.
MPYID No other instruction on the same side can use the cross path on cycles I, I + 1, I + 2, and I + 3.
MPYDP No other instruction on the same side can use the cross path on cycles I, I + 1, I + 2, and I + 3.
Other hazards exist because instructions have varying numbers of delay slots, and need
the functional unit read and write ports of varying numbers of cycles. A read or write
hazard exists when two instructions on the same functional unit attempt to read or
write, respectively, to the register file on the same cycle.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-23
Submit Documentation Feedback
3.9 Addressing Modes
Chapter 3—Instruction Set www.ti.com
An instruction of the following types scheduled on cycle I has the following constraints:
2-cycle DP A single-cycle instruction cannot be scheduled on that functional unit on cycle I + 1 due to a write
hazard on cycle I + 1.
Another 2-cycle DP instruction cannot be scheduled on that functional unit on cycle I + 1 due to a
write hazard on cycle I + 1.
4-cycle A single-cycle instruction cannot be scheduled on that functional unit on cycle I + 3 due to a write
hazard on cycle I + 3.
A multiply (16 16-bit) instruction cannot be scheduled on that functional unit on cycle I + 2 due to
a write hazard on cycle I + 3.
INTDP A single-cycle instruction cannot be scheduled on that functional unit on cycle I + 3 or I + 4 due to
a write hazard on cycle I + 3 or I + 4, respectively.
An INTDP instruction cannot be scheduled on that functional unit on cycle I + 1 due to a write
hazard on cycle I + 1.
A 4-cycle instruction cannot be scheduled on that functional unit on cycle I + 1 due to a write
hazard on cycle I + 1.
MPYI 4-cycle instruction cannot be scheduled on that functional unit on cycle I + 4, I + 5, or I + 6.
A MPYDP instruction cannot be scheduled on that functional unit on cycle I + 4, I + 5, or I + 6.
A multiply (16 16-bit) instruction cannot be scheduled on that functional unit on cycle I + 6 due to
a write hazard on cycle I + 7.
MPYID A 4-cycle instruction cannot be scheduled on that functional unit on cycle I + 4, I + 5, or I + 6.
A MPYDP instruction cannot be scheduled on that functional unit on cycles I + 4, I + 5, or I + 6.
A multiply (16 16-bit) instruction cannot be scheduled on that functional unit on cycle I + 7 or I + 8
due to a write hazard on cycle I + 8 or I + 9, respectively.
MPYDP A 4-cycle instruction cannot be scheduled on that functional unit on cycle I + 4, I + 5, or I + 6.
A MPYI instruction cannot be scheduled on that functional unit on cycle I + 4, I + 5, or I + 6.
A MPYID instruction cannot be scheduled on that functional unit on cycle I + 4, I + 5, or I + 6.
A multiply (16 × 16-bit) instruction cannot be scheduled on that functional unit on cycle I + 7 or I +
8 due to a write hazard on cycle I + 8 or I + 9, respectively.
ADDDP/SUB A single-cycle instruction cannot be scheduled on that functional unit on cycle I + 5 or I + 6 due to
DP a write hazard on cycle I + 5 or I + 6, respectively.
A 4-cycle instruction cannot be scheduled on that functional unit on cycle I + 2 or I + 3 due to a
write hazard on cycle I + 5 or I + 6, respectively.
An INTDP instruction cannot be scheduled on that functional unit on cycle I + 2 or I + 3 due to a
write hazard on cycle I + 5 or I + 6, respectively.
All of the previous cases deal with double-precision floating-point instructions or the
MPYI or MPYID instructions except for the 4-cycle case. A 4-cycle instruction consists
of both single- and double-precision floating-point instructions. Therefore, the 4-cycle
case is important for the following single-precision floating-point instructions:
• ADDSP
• SUBSP
• SPINT
• SPTRUNC
• INTSP
• MPYSP
3-24 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.9 Addressing Modes
www.ti.com Chapter 3—Instruction Set
All registers can perform linear addressing. Only eight registers can perform circular
addressing: A4-A7 are used by the .D1 unit, and B4-B7 are used by the .D2 unit. No
other units can perform circular addressing. LDB(U)/LDH(U)/LDW,
STB/STH/STW, LDNDW, LDNW, STNDW, STNW, LDDW, STDW,
ADDAB/ADDAH/ADDAW/ADDAD, and SUBAB/SUBAH/SUBAW instructions
all use AMR to determine what type of address calculations are performed for these
registers. There is no SUBAD instruction.
For the preincrement, predecrement, positive offset, and negative offset address
generation options, the result of the calculation is the address to be accessed in
memory. For postincrement or postdecrement addressing, the value of baseR before
the addition or subtraction is the address to be accessed from memory.
The circular buffer size in AMR is not scaled; for example, a block-size of 8 is 8 bytes,
not 8 times the data size (byte, halfword, word). So, to perform circular addressing on
an array of 8 words, a size of 32 should be specified, or N = 4. Example 3-4 shows an
LDW performed with register A4 in circular mode and BK0 = 4, so the buffer size is 32
bytes, 16 halfwords, or 8 words. The value in AMR for this example is 00040001h.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-25
Submit Documentation Feedback
3.9 Addressing Modes
Chapter 3—Instruction Set www.ti.com
mem 104h 1234 5678h mem 104h 1234 5678h mem 104h 1234 5678h
1. 9h words is 24h bytes. 24h bytes is 4 bytes beyond the 32-byte (20h) boundary 100h-11Fh; thus, it is wrapped
around to (124h - 20h = 104h).
The circular buffer size in AMR is not scaled; for example, a block size of 8 is 8 bytes,
not 8 times the data size (byte, halfword, word). So, to perform circular addressing on
an array of 8 words, a size of 32 should be specified, or N = 4. Example 3-5 shows an
ADDAH performed with register A4 in circular mode and BK0 = 4, so the buffer size
is 32 bytes, 16 halfwords, or 8 words. The value in AMR for this example is 00040001h.
A4 00000100h A4 00000106h
1. 13h halfwords is 26h bytes. 26h bytes is 6 bytes beyond the 32-byte (20h) boundary 100h-11Fh; thus, it is wrapped
around to (126h - 20h = 106h).
On the CPU, the circular buffer size must be at least 32 bytes. Nonaligned access to
circular buffers that are smaller than 32 bytes will cause undefined results.
3-26 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.9 Addressing Modes
www.ti.com Chapter 3—Instruction Set
Consider, for example, a circular buffer size of 16 bytes. A circular buffer of this size at
location 20h, would look like this in physical memory:
1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3
7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8
x x x x x x x x x a b c d e f g h i j k l m n o p x x x x x x x x x
The effect of circular buffering is to make it so that memory accesses and address
updates in the 20h-2Fh range stay completely inside this range. Effectively, the memory
map behaves in this manner:
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8
h i j k l m n o p a b c d e f g h i j k l m n o p a b c d e f g h i
Example 3-6 shows an LDNW performed with register A4 in circular mode and BK0 =
4, so the buffer size is 32 bytes, 16 halfwords, or 8 words. The value in AMR for this
example is 0004 0001h. The buffer starts at address 0020h and ends at 0040h. The
register A4 is initialized to the address 003Ah.
LDNW.D1 *++A4[2],A1
mem 0022h 5678 9ABCh mem 0022h 5678 9ABCh mem 0022h 5678 9ABCh
1. 2h words is 8h bytes. 8h bytes is 2 bytes beyond the 32-byte (20h) boundary starting at address 003Ah; thus, it is
wrapped around to 0022h (003Ah + 8h = 0022h).
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-27
Submit Documentation Feedback
3.9 Addressing Modes
Chapter 3—Instruction Set www.ti.com
Table 3-11 describes the addressing generator options. The memory address is formed
from a base address register (baseR) and an optional offset that is either a register
(offsetR) or a 5-bit unsigned constant (ucst5).
Table 3-10 Indirect Address Generation for Load/Store
No Modification of Preincrement or Predecrement of Postincrement or Postdecrement
Addressing Type Address Register Address Register of Address Register
Register indirect *R *++R *R++
*- -R *R- -
Register relative *+R[ucst5] *++R[ucst5] *R++[ucst5]
*-R[ucst5] *- -R[ucst5] *R- -[ucst5]
Register relative with 15-bit constant offset *+B14/B15[ucst15] not supported not supported
Base + index *+R[offsetR] *++R[offsetR] *R++[offsetR]
*-R[offsetR] *- -R[offsetR] *R- -[offsetR]
3-28 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.10 Compact Instructions on the CPU
www.ti.com Chapter 3—Instruction Set
Within the other seven words of the fetch packet, each word may be composed of a
single 32-bit opcode or two 16-bit opcodes. The header word specifies which words
contain compact opcodes and which contain 32-bit opcodes.
The compiler will automatically code instructions as 16-bit compact instructions when
possible.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-29
Submit Documentation Feedback
3.10 Compact Instructions on the CPU
Chapter 3—Instruction Set www.ti.com
1 1 1 0 Layout Expansion
7 7
14 13 0
Expansion p-bits
7 14
Bits 27-21 (Layout field) indicate which words in the fetch packet contain 32-bit
opcodes and which words contain two 16-bit opcodes.
Bits 20-14 (Expansion field) contain information that contributes to the decoding of
all compact instructions in the fetch packet.
Bits 13-0 (p-bits field) specify which compact instructions are run in parallel.
Figure 3-7 shows the layout field in the compact header word and Table 3-12 describes
the bits.
Figure 3-7 Layout Field in Compact Header Word
27 26 25 24 23 22 21
L7 L6 L5 L4 L3 L2 L1
3-30 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.10 Compact Instructions on the CPU
www.ti.com Chapter 3—Instruction Set
Figure 3-8 shows the expansion field in the compact header word and Table 3-13
describes the bits.
Figure 3-8 Expansion Field in Compact Header Word
20 19 18 16 15 14
PROT RS DSZ BR SAT
Bit 20 (PROT) selects between protected and nonprotected mode for all LD
instructions within the fetch packet. When PROT is 1, four cycles of NOP are added
after each LD instruction within the fetch packet whether the LD is in 16-bit compact
format or 32-bit format.
Bit 19 (RS) specifies which register set is used by compact instructions within the fetch
packet. The register set defines which subset of 8 registers on each side are data
registers. The 3-bit register field in the compact opcode indicates which one of eight
registers is used. When RS is 1, the high register set (A16-A23 and B16-B23) is used;
when RS is 0, the low register set (A0-A7 and B0-B7) is used.
Bits 18-16 (DSZ) determine the two data sizes available to the compact versions of the
LD and ST instructions in a fetch packet. Bit 18 determines the primary data size that
is either word (W) or doubleword (DW). In the case of DW, an opcode bit selects
between aligned (DW) and nonaligned (NDW) accesses. Bits 17 and 16 determine the
secondary data size: byte unsigned (BU), byte (B), halfword unsigned (HU), halfword
(H), word (W), or nonaligned word (NW). Table 3-14 describes how the bits map to
data size.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-31
Submit Documentation Feedback
3.10 Compact Instructions on the CPU
Chapter 3—Instruction Set www.ti.com
Bit 14 (SAT). When SAT is 1, the ADD, SUB, SHL, MPY, MPYH, MPYLH, and
MPYHL instructions are decoded as SADD, SUBS, SSHL, SMPY, SMPYH, SMPYLH,
and SMPYHL, respectively.
Table 3-14 LD/ST Data Size Selection
DSZ Bits Primary Secondary
18 17 16 Data Size1 Data Size2
0 0 0 W BU
0 0 1 W B
0 1 0 W HU
0 1 1 W H
1 0 0 DW/NDW W
1 0 1 DW/NDW B
1 1 0 DW/NDW NW
1 1 1 DW/NDW H
1. Primary data size is word W) or doubleword (DW). In the case of DW, aligned (DW) or
nonaligned (NDW).
2. Secondary data size is byte unsigned (BU), byte (B), halfword unsigned (HU), halfword (H),
word (W), or nonaligned word (NW).
Bits 13-0 of the compact instruction header contain the p-bit field. This field specifies
which of the compact instructions within the current fetch packet are executed in
parallel. If the corresponding bit in the layout field is 0 (indicating that the word is a
noncompact instruction), then the bit in the p-bit field must be zero; that is, 32-bit
instructions within compact fetch packets use their own p-bit field internal to the 32-bit
opcode; therefore, the associated p-bit field in the header should always be zero.
Figure 3-9 shows the p-bits field in the compact header word and Table 3-15 describes
the bits.
Figure 3-9 P-bits Field in Compact Header Word
13 12 11 10 9 8 7 6 5 4 3 2 1 0
P13 P12 P11 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0
Table 3-15 P-bits Field Description in Compact Instruction Packet Header (Part 1 of 2)
Bit Field Value Description
13 P13 0 Word 6 (16 most-significant bits) of fetch packet has parallel bit cleared.
1 Word 6 (16 most-significant bits) of fetch packet has parallel bit set.
12 P12 0 Word 6 (16 least-significant bits) of fetch packet has parallel bit cleared.
1 Word 6 (16 least-significant bits) of fetch packet has parallel bit set.
11 P11 0 Word 5 (16 most-significant bits) of fetch packet has parallel bit cleared.
1 Word 5 (16 most-significant bits) of fetch packet has parallel bit set.
10 P10 0 Word 5 (16 least-significant bits) of fetch packet has parallel bit cleared.
1 Word 5 (16 least-significant bits) of fetch packet has parallel bit set.
3-32 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.10 Compact Instructions on the CPU
www.ti.com Chapter 3—Instruction Set
Table 3-15 P-bits Field Description in Compact Instruction Packet Header (Part 2 of 2)
Bit Field Value Description
9 P9 0 Word 4 (16 most-significant bits) of fetch packet has parallel bit cleared.
1 Word 4 (16 most-significant bits) of fetch packet has parallel bit set.
8 P8 0 Word 4 (16 least-significant bits) of fetch packet has parallel bit cleared.
1 Word 4 (16 least-significant bits) of fetch packet has parallel bit set.
7 P7 0 Word 3 (16 most-significant bits) of fetch packet has parallel bit cleared.
1 Word 3 (16 most-significant bits) of fetch packet has parallel bit set.
6 P6 0 Word 3 (16 least-significant bits) of fetch packet has parallel bit cleared.
1 Word 3 (16 least-significant bits) of fetch packet has parallel bit set.
5 P5 0 Word 2 (16 most-significant bits) of fetch packet has parallel bit cleared.
1 Word 2 (16 most-significant bits) of fetch packet has parallel bit set.
4 P4 0 Word 2 (16 least-significant bits) of fetch packet has parallel bit cleared.
1 Word 2 (16 least-significant bits) of fetch packet has parallel bit set.
3 P3 0 Word 1 (16 most-significant bits) of fetch packet has parallel bit cleared.
1 Word 1 (16 most-significant bits) of fetch packet has parallel bit set.
2 P2 0 Word 1 (16 least-significant bits) of fetch packet has parallel bit cleared.
1 Word 1 (16 least-significant bits) of fetch packet has parallel bit set.
1 P1 0 Word 0 (16 most-significant bits) of fetch packet has parallel bit cleared.
1 Word 0 (16 most-significant bits) of fetch packet has parallel bit set.
0 P0 0 Word 0 (16 least-significant bits) of fetch packet has parallel bit cleared.
1 Word 0 (16 least-significant bits) of fetch packet has parallel bit set.
If the execute packet contains eight instructions, then neither of the two fetch packets
may be header-based.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-33
Submit Documentation Feedback
3.10 Compact Instructions on the CPU
Chapter 3—Instruction Set www.ti.com
3-34 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
3.11 Instruction Compatibility
www.ti.com Chapter 3—Instruction Set
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 3-35
Submit Documentation Feedback
3.11 Instruction Compatibility
Chapter 3—Instruction Set www.ti.com
3-36 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Chapter 4
Instruction Descriptions
This section gives detailed information on the instruction set. Each instruction may
present the following information:
• Assembler syntax
• Functional units
• Operands
• Opcode
• Description
• Execution
• Pipeline
• Instruction type
• Delay slots
• Functional Unit Latency
• Examples
The ADD instruction is used as an example to familiarize you with the way each
instruction is described. The example describes the kind of information you will find in
each part of the individual instruction description and where to obtain more
information.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-1
Submit Documentation Feedback
4.1 Example
Chapter 4—Instruction Descriptions www.ti.com
4.1 Example
The way each instruction is described.
src and dst indicate source and destination, respectively. The (.unit) dictates which
functional unit the instruction is mapped to (.L1, .L2, .S1, .S2, .M1, .M2, .D1, or .D2).
A table is provided for each instruction that gives the opcode map fields, units the
instruction is mapped to, types of operands, and the opcode.
The opcode shows the various fields that make up each instruction. These fields are
described in Table 4-2.
There are instructions that can be executed on more than one functional unit. Table 4-1
on page 4-3 shows how this is documented for the ADD instruction. This instruction
has three opcode map fields: src1, src2, and dst. In the fifth group, the operands have
the types cst5, long, and long for src1, src2, and dst, respectively. The ordering of these
fields implies cst5 + long →long, where + represents the operation being performed by
the ADD. This operation can be done on .L1 or .L2 (both are specified in the unit
column). The s in front of each operand signifies that src1 (scst5), src2 (slong), and dst
(slong) are all signed values.
In the ninth group, src1, src2, and dst are int, cst5, and int, respectively. The u in front
of the cst5 operand signifies that src1 (ucst5) is an unsigned value. Any operand that
begins with x can be read from a register file that is different from the destination
register file. The operand comes from the register file opposite the destination, if the x
bit in the instruction is set (shown in the opcode map).
Description Instruction execution and its effect on the rest of the processor or memory contents are
described. Any constraints on the operands imposed by the processor or the assembler
are discussed. The description parallels and supplements the information given by the
execution block.
Execution for .L1, .L2 and .S1, if (cond)src1 + src2 → dst else nop
.S2 Opcodes
Execution for .D1, .D2 Opcodes if (cond)src2 + src1 → dst else nop
4-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.1 Example
www.ti.com Chapter 4—Instruction Descriptions
The execution describes the processing that takes place when the instruction is
executed. The symbols are defined in Table 4-1.
Pipeline This section contains a table that shows the sources read from, the destinations written
to, and the functional unit used during each execution cycle of the instruction.
Instruction Type This section gives the type of instruction. See Section 5.2 for information about the
pipeline execution of this type of instruction.
Delay Slots This section gives the number of delay slots the instruction takes to execute See Section
3.4 on page 3-9 for an explanation of delay slots.
Functional Unit Latency This section gives the number of cycles that the functional unit is in use during the
execution of the instruction.
Example Examples of instruction execution. If applicable, register and memory values are given
before and after instruction execution.
Table 4-1 Relationships Between Operands, Operand Size, Functional Units, and Opfields
for Example Instruction (ADD)
Opcode map field used... For operand type... Unit Opfield
src1 sint .L1, .L2 0000011
src2 xsint
dst sint
src1 sint .L1, .L2 0100011
src2 xsint
dst slong
src1 xsint .L1, .L2 0100001
src2 slong
dst slong
src1 scst5 .L1, .L2 0000010
src2 xsint
dst sint
src1 scst5 .L1, .L2 0100000
src2 slong
dst slong
src1 sint .S1, .S2 000111
src2 xsint
dst sint
src1 scst5 .S1, .S2 000110
src2 xsint
dst sint
src2 sint .D1, .D2 010000
src1 sint
dst sint
src2 sint .D1, .D2 010010
src1 ucst5
dst sint
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-3
Submit Documentation Feedback
4.2 ABS
Chapter 4—Instruction Descriptions www.ti.com
4.2 ABS
Absolute Value With Saturation
or
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 5 4 3 2 1 0
0 0 0 x op 1 1 0 s p
1 7 1 1
4-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.2 ABS
www.ti.com Chapter 4—Instruction Descriptions
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
Example 2
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-5
Submit Documentation Feedback
4.2 ABS
Chapter 4—Instruction Descriptions www.ti.com
4-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.3 ABS2
www.ti.com Chapter 4—Instruction Descriptions
4.3 ABS2
Absolute Value With Saturation, Signed, Packed 16-Bit
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 0 0 x 0 0 1 1 0 1 0 1 1 0 s p
1 1 1
Description The absolute values of the upper and lower halves of the src2 operand are placed in the
upper and lower halves of the dst.
31 16 15 0
a_hi a_lo ←src1
ABS2
↓ ↓
31 16 15 0
abs(a_hi) abs(a_lo) ←dst
Specifically, this instruction performs the following steps for each halfword of src2,
then writes its result to the appropriate halfword of dst:
1. If the value is between 0 and 215, then value →dst
2. If the value is less than 0 and not equal to -215, then -value →dst
3. If the value is equal to -215, then 215 -1 →dst
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-7
Submit Documentation Feedback
4.3 ABS2
Chapter 4—Instruction Descriptions www.ti.com
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L
Delay Slots 0
See Also ABS
Examples Example 1
Example 2
ABS2 .L1 A0,A2
4-8 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.4 ABSDP
www.ti.com Chapter 4—Instruction Descriptions
4.4 ABSDP
Absolute Value, Double-Precision Floating-Point
Opcode
31 29 28 27 23 22 18 17
3 1 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
Reserved x 1 0 1 1 0 0 1 0 0 0 s p
1 1 1
Description The absolute value of src2 is placed in dst. The 64-bit double-precision operand is read
in one cycle by using the src2 port for the 32 MSBs and the src1 port for the 32 LSBs.
Note—
1) If scr2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2 bits are set.
2) If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3) If src2 is denormalized, +0 is placed in dst and the INEX and DEN2 bits are set.
4) If src2 is +infinity or −infinity, +infinity is placed in dst and the INFO bit is set.
Execution if (cond) abs(src2) → dst
else nop
Pipeline
Pipeline Stage E1 E2
Read src2_l, src2_h
Written dst_l dst_h
Unit in use .S
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-9
Submit Documentation Feedback
4.4 ABSDP
Chapter 4—Instruction Descriptions www.ti.com
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 1
A1:A0 C004 0000h 0000 0000h -2.5 A1:A0 C004 0000h 0000 0000h
A3:A2 xxxx xxxxh xxxx xxxxh A3:A2 4004 0000h 0000 0000h 2.5
4-10 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.5 ABSSP
www.ti.com Chapter 4—Instruction Descriptions
4.5 ABSSP
Absolute Value, Single-Precision Floating-Point
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 1 1 1 1 0 0 1 0 0 0 s p
1 1 1
Note—
1) If scr2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2 bits are set.
2) If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3) If src2 is denormalized, +0 is placed in dst and the INEX and DEN2 bits are set.
4) If src2 is +infinity or −infinity, +infinity is placed in dst and the INFO bit is set.
Execution if (cond) abs(src2) → dst
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .S
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-11
Submit Documentation Feedback
4.5 ABSSP
Chapter 4—Instruction Descriptions www.ti.com
4-12 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.6 ADD
www.ti.com Chapter 4—Instruction Descriptions
4.6 ADD
Add Two Signed Integers Without Saturation
or
or
or
ADD (.D1 or .D2) src2, src1, dst (if the cross path form is not used)
or
ADD (.D1 or .D2) src1, src2, dst (if the cross path form is used)
or
ADD (.D1 or .D2) src2, src1, dst (if the cross path form is used with a constant)
Opcode .L unit
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-13
Submit Documentation Feedback
4.6 ADD
Chapter 4—Instruction Descriptions www.ti.com
Opcode .S unit
31 29 28 27 2 2 1 17
3 2 8
3 5 5 4
14 13 12 11 2 1 0
src1 op x 1011011000 s p
10 1 1
Description for .L1, .L2 and .S1, src2 is added to src1. The result is placed in dst.
.S2 Opcodes
31 29 28 27 23 22 18 17
creg z dst src2 src1
3 1 5 5 5
13 12 7 6 5 4 3 2 1 0
src1 op 1 0 0 0 0 s p
5 6 1 1
4-14 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.6 ADD
www.ti.com Chapter 4—Instruction Descriptions
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 0 1 0 1 1 0 0 s p
5 1 1 1
Opcode .D unit (if the cross path form is used with a constant)
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 0 1 1 1 1 0 0 s p
5 1 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-15
Submit Documentation Feedback
4.6 ADD
Chapter 4—Instruction Descriptions www.ti.com
Description for .D1, .D2 src1 is added to src2. The result is placed in dst.
Opcodes
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S, or .D
Delay Slots 0
See Also ADDU, ADD2, SADD
Examples Example 1
Example 2
ADD .L1 A1,A3:A2,A5:A4
4-16 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.6 ADD
www.ti.com Chapter 4—Instruction Descriptions
Example 3
Example 4
Example 5
B0 00000007h B0 00000007h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-17
Submit Documentation Feedback
4.7 ADDAB
Chapter 4—Instruction Descriptions www.ti.com
4.7 ADDAB
Add Using Byte Addressing Mode
or
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 7 6 5 4 3 2 1 0
src1 op 1 0 0 0 0 s p
5 6 1 1
4-18 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.7 ADDAB
www.ti.com Chapter 4—Instruction Descriptions
Description src1 is added to src2 using the byte addressing mode specified for src2. The addition
defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the mode can be
changed to circular mode by writing the appropriate value to the AMR (see
‘‘Addressing Mode Register (AMR)’’ on page 2-12). The result is placed in dst.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Opcode
31 30 29 28 27 23 22
0 0 0 1 dst ucst15
5 15
8 7 6 5 4 3 2 1 0
ucst15 y 0 1 1 1 1 s p
15 1 1 1
Description This instruction reads a register (baseR), B14 (y = 0) or B15 (y = 1), and adds a 15-bit
unsigned constant (ucst15) to it, writing the result to a register (dst). This instruction is
executed unconditionally, it cannot be predicated.
The offset, ucst15, is added to baseR. The result of the calculation is written into dst. The
addressing arithmetic is always performed in linear mode.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-19
Submit Documentation Feedback
4.7 ADDAB
Chapter 4—Instruction Descriptions www.ti.com
The s bit determines the unit used (D1 or D2) and the file the destination is written to:
s = 0 indicates the unit is D1 and dst is in the A register file; and s = 1 indicates the unit
is D2 and dst is in the B register file.
Pipeline
Pipeline Stage E1
Read B14/B15
Written dst
Unit in use .D
Delay Slots 0
Examples Example 1
ADDAB .D1 A4,A2,A4
A2 0000000Bh A2 0000000Bh
A4 00000100h A4 00000103h
Example 2
Example 3
4-20 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.8 ADDAD
www.ti.com Chapter 4—Instruction Descriptions
4.8 ADDAD
Add Using Doubleword Addressing Mode
unit = . D1 or .D2
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 7 6 5 4 3 2 1 0
src1 op 1 0 0 0 0 s p
5 6 1 1
Description src1 is added to src2 using the doubleword addressing mode specified for src2. The
addition defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the mode
can be changed to circular mode by writing the appropriate value to the AMR (see
‘‘Addressing Mode Register (AMR)’’ on page 2-12). src1 is left shifted by 3 due to
doubleword data sizes. The result is placed in dst.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-21
Submit Documentation Feedback
4.8 ADDAD
Chapter 4—Instruction Descriptions www.ti.com
4-22 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.9 ADDAH
www.ti.com Chapter 4—Instruction Descriptions
4.9 ADDAH
Add Using Halfword Addressing Mode
or
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 7 6 5 4 3 2 1 0
src1 op 1 0 0 0 0 s p
5 6 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-23
Submit Documentation Feedback
4.9 ADDAH
Chapter 4—Instruction Descriptions www.ti.com
Description src1 is added to src2 using the halfword addressing mode specified for src2. The
addition defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the mode
can be changed to circular mode by writing the appropriate value to the AMR (see
‘‘Addressing Mode Register (AMR)’’ on page 2-12). src1 is left shifted by 1. The result
is placed in dst.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Opcode
31 30 29 28 27 23 22
0 0 0 1 dst ucst15
5 15
8 7 6 5 4 3 2 1 0
ucst15 y 1 0 1 1 1 s p
15 1 1 1
Description This instruction reads a register (baseR), B14 (y = 0) or B15 (y = 1), and adds a scaled
15-bit unsigned constant (ucst15) to it, writing the result to a register (dst). This
instruction is executed unconditionally, it cannot be predicated.
The offset, ucst15, is scaled by a left-shift of 1 and added to baseR. The result of the
calculation is written into dst. The addressing arithmetic is always performed in linear
mode.
4-24 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.9 ADDAH
www.ti.com Chapter 4—Instruction Descriptions
The s bit determines the unit used (D1 or D2) and the file the destination is written to:
s = 0 indicates the unit is D1 and dst is in the A register file; and s = 1 indicates the unit
is D2 and dst is in the B register file.
Pipeline
Pipeline Stage E1
Read B14/B15
Written dst
Unit in use .D
Delay Slots 0
Examples Example 1
ADDAH .D1 A4,A2,A4
A2 0000000Bh A2 0000000Bh
A4 00000100h A4 00000106h
Example 2
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-25
Submit Documentation Feedback
4.9 ADDAH
Chapter 4—Instruction Descriptions www.ti.com
4-26 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.10 ADDAW
www.ti.com Chapter 4—Instruction Descriptions
4.10 ADDAW
Add Using Word Addressing Mode
or
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 7 6 5 4 3 2 1 0
src1 op 1 0 0 0 0 s p
5 6 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-27
Submit Documentation Feedback
4.10 ADDAW
Chapter 4—Instruction Descriptions www.ti.com
Description src1 is added to src2 using the word addressing mode specified for src2. The addition
defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the mode can be
changed to circular mode by writing the appropriate value to the AMR (see
‘‘Addressing Mode Register (AMR)’’ on page 2-12). src1 is left shifted by 2. The result
is placed in dst.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Opcode
31 30 29 28 27 23 22
0 0 0 1 dst ucst15
5 15
8 7 6 5 4 3 2 1 0
ucst15 y 1 1 1 1 1 s p
15 1 1 1
Description This instruction reads a register (baseR), B14 (y = 0) or B15 (y = 1), and adds a scaled
15-bit unsigned constant (ucst15) to it, writing the result to a register (dst). This
instruction is executed unconditionally, it cannot be predicated.
The offset, ucst15, is scaled by a left-shift of 2 and added to baseR. The result of the
calculation is written into dst. The addressing arithmetic is always performed in linear
mode.
The s bit determines the unit used (D1 or D2) and the file the destination is written to:
s = 0 indicates the unit is D1 and dst is in the A register file; and s = 1 indicates the unit
is D2 and dst is in the B register file.
Delay Slots 0
Examples Example 1
ADDAW .D1 A4,2,A4
4-28 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.10 ADDAW
www.ti.com Chapter 4—Instruction Descriptions
A4 00020000h A4 00020000h
Example 2
ADDAW .D1X B14,42h,A4
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-29
Submit Documentation Feedback
4.11 ADDDP
Chapter 4—Instruction Descriptions www.ti.com
4.11 ADDDP
Add Two Double-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
Note—
1) This instruction takes the rounding mode from and sets the warning bits in the
floating-point adder configuration register (FADCR), not the floating-point
auxiliary configuration register (FAUCR) as for other .S unit instructions.
2) If rounding is performed, the INEX bit is set.
3) If one source is SNaN or QNaN, the result is NaN_out. If either source is SNaN,
the INVAL bit is also set.
4) If one source is +infinity and the other is −infinity, the result is NaN_out and
the INVAL bit is set.
5) If one source is signed infinity and the other source is anything except NaN or
signed infinity of the opposite sign, the result is signed infinity and the INFO
bit is set.
6) If overflow occurs, the INEX and OVER bits are set and the results are rounded
as follows (LFPN is the largest floating-point number):
7) If underflow occurs, the INEX and UNDER bits are set and the results are
rounded as follows (SPFN is the smallest floating-point number):
4-30 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.11 ADDDP
www.ti.com Chapter 4—Instruction Descriptions
8) If the sources are equal numbers of opposite sign, the result is +0 unless the
rounding mode is −infinity, in which case the result is −0.
9) If the sources are both 0 with the same sign or both are denormalized with the
same sign, the sign of the result is negative for negative sources and positive for
positive sources.
10)A signed denormalized source is treated as a signed 0 and the DENn bit is set.
If the other source is not NaN or signed infinity, the INEX bit is set.
Execution if (cond)src1 + src2 → dst
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5 E6 E7
Read src1_l, src1_h,
src2_l src2_h
Written dst_l dst_h
Unit in use .L or .S .L or .S
The low half of the result is written out one cycle earlier than the high half. If dst is used
as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP, MPYDP,
MPYSPDP, MPYSP2DP, or SUBDP instruction, the number of delay slots can be
reduced by one, because these instructions read the lower word of the DP source one
cycle before the upper word of the DP source.
Delay Slots 6
B1:B0 4021 3333h 3333 3333h B1:B0 4021 3333h 4021 3333h 8.6
A3:A2 C004 0000h 0000 0000h A3:A2 C004 0000h 0000 0000h -2.5
A5:A4 xxxx xxxxh xxxx xxxxh A5:A4 4018 6666h 6666 6666h 6.1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-31
Submit Documentation Feedback
4.12 ADDK
Chapter 4—Instruction Descriptions www.ti.com
4.12 ADDK
Add Signed 16-Bit Constant to Register
Opcode
31 1129 28 27 23 22
3 1 5 16
7 6 5 4 3 2 1 0
cst16 1 0 1 0 0 s p
16 1 1
Description A 16-bit signed constant, cst16, is added to the dst register specified. The result is placed
in dst.
Pipeline
Pipeline Stage E1
Read cst16
Written dst
Unit in use .S
Delay Slots 0
4-32 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.13 ADDKPC
www.ti.com Chapter 4—Instruction Descriptions
4.13 ADDKPC
Add Signed 7-Bit Constant to Program Counter
unit = .S2
Opcode
31 1229 28 27 23 22 16
3 1 5 7
15 1313 12 11 10 9 8 7 6 5 4 3 2 1 0
src2 0 0 0 0 1 0 1 1 0 0 0 s p
3 1 1
Description A 7-bit signed constant, src1, is shifted 2 bits to the left, then added to the address of the
first instruction of the fetch packet that contains the ADDKPC instruction (PCE1).
The result is placed in dst. The 3-bit unsigned constant, src2, specifies the number of
NOP cycles to insert after the current instruction. This instruction helps reduce the
number of instructions needed to set up the return address for a function call.
B .S2 func
MVKL .S2 LABEL, B3
MVKH .S2 LABEL, B3
NOP 3
LABEL
B .S2 func
ADDKPC .S2 LABEL, B3, 4
LABEL
The 7-bit value coded as src1 is the difference between LABEL and PCE1 shifted right
by 2 bits. The address of LABEL must be within 9 bits of PCE1.
Only one ADDKPC instruction can be executed per cycle. An ADDKPC instruction
cannot be paired with any relative branch instruction in the same execute packet. If an
ADDKPC and a relative branch are in the same execute packet, and if the ADDKPC
instruction is executed when the branch is taken, behavior is undefined.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-33
Submit Documentation Feedback
4.13 ADDKPC
Chapter 4—Instruction Descriptions www.ti.com
The ADDKPC instruction cannot be paired with any other multicycle NOP
instruction in the same execute packet. Instructions that generate a multicycle NOP
are: IDLE, BNOP, and the multicycle NOP.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
4-34 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.14 ADDSP
www.ti.com Chapter 4—Instruction Descriptions
4.14 ADDSP
Add Two Single-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
Note—
1) This instruction takes the rounding mode from and sets the warning bits in the
floating-point adder configuration register (FADCR), not in the floating-point
auxiliary configuration register (FAUCR) as for other .S unit instructions.
2) If rounding is performed, the INEX bit is set.
3) If one source is SNaN or QNaN, the result is NaN_out. If either source is SNaN,
the INVAL bit is also set.
4) If one source is +infinity and the other is −infinity, the result is NaN_out and
the INVAL bit is set.
5) If one source is signed infinity and the other source is anything except NaN or
signed infinity of the opposite sign, the result is signed infinity and the INFO
bit is set.
6) If overflow occurs, the INEX and OVER bits are set and the results are rounded
as follows (LFPN is the largest floating-point number):
7) If underflow occurs, the INEX and UNDER bits are set and the results are
rounded as follows (SPFN is the smallest floating-point number):
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-35
Submit Documentation Feedback
4.14 ADDSP
Chapter 4—Instruction Descriptions www.ti.com
8) If the sources are equal numbers of opposite sign, the result is +0 unless the
rounding mode is −infinity, in which case the result is −0.
9) If the sources are both 0 with the same sign or both are denormalized with the
same sign, the sign of the result is negative for negative sources and positive for
positive sources.
10)A signed denormalized source is treated as a signed 0 and the DENn bit is set.
If the other source is not NaN or signed infinity, the INEX bit is set.
Execution if (cond)src1 + src2 → dst
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .L or .S
Delay Slots 3
4-36 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.15 ADDSUB
www.ti.com Chapter 4—Instruction Descriptions
4.15 ADDSUB
Parallel ADD and SUB Operations On Common Inputs
Opcode
31 30 29 28 27 24 23 22 18 17
5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 1 0 0 1 1 0 s p
5 1 1 1
Delay Slots 0
Examples Example 1
A0 0700C005h A2 0700C006h
A1 FFFFFFFFh A3 0700C004h
Example 2
B0 7FFFFFFFh B2 7FFFFFFEh
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-37
Submit Documentation Feedback
4.15 ADDSUB
Chapter 4—Instruction Descriptions www.ti.com
A1 00000001h B3 80000000h
4-38 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.16 ADDSUB2
www.ti.com Chapter 4—Instruction Descriptions
4.16 ADDSUB2
Parallel ADD2 and SUB2 Operations On Common Inputs
Opcode
31 30 29 28 27 24 23 22 18 17
5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 1 0 1 1 1 0 s p
5 1 1 1
Description For the ADD2 operation, the upper and lower halves of the src2 operand are added to
the upper and lower halves of the src1 operand. The values in src1 and src2 are treated
as signed, packed 16-bit data and the results are written in signed, packed 16-bit format
into dst_o.
For the SUB2 operation, the upper and lower halves of the src2 operand are subtracted
from the upper and lower halves of the src1 operand. The values in src1 and src2 are
treated as signed, packed 16-bit data and the results are written in signed, packed 16-bit
format into dst_e.
Delay Slots 0
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-39
Submit Documentation Feedback
4.16 ADDSUB2
Chapter 4—Instruction Descriptions www.ti.com
Example 2
Example 3
Example 4
4-40 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.17 ADDU
www.ti.com Chapter 4—Instruction Descriptions
4.17 ADDU
Add Two Unsigned Integers Without Saturation
or
or
Opcode .L unit
31 29 28 27 23 22 18 17
creg z dst src2 src1
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
1 7 1 1
Opcode .S unit
31 29 28 27 23 22 18 17
3 5 5 4
14 13 12 11 2 1 0
src1 op x 1011101000 s p
10
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-41
Submit Documentation Feedback
4.17 ADDU
Chapter 4—Instruction Descriptions www.ti.com
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
ADDU .L1 A1,A2,A5:A4
Example 2
4-42 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.18 ADD2
www.ti.com Chapter 4—Instruction Descriptions
4.18 ADD2
Add Two 16-Bit Integers on Upper and Lower Register Halves
Opcode .S unit
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 0 0 1 1 0 0 0 s p
5 1 1 1
Opcode .L Unit
31 29 28 27 23 22 18 17
creg z dst src2 src1
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 0 1 0 1 1 1 0 s p
5 1 1 1
Opcode .D unit
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 1 0 0 1 1 0 0 s p
5 1 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-43
Submit Documentation Feedback
4.18 ADD2
Chapter 4—Instruction Descriptions www.ti.com
Description The upper and lower halves of the src1 operand are added to the upper and lower halves
of the src2 operand. The values in src1 and src2 are treated as signed, packed 16-bit data
and the results are written in signed, packed 16-bit format into dst.
For each pair of signed packed 16-bit values found in the src1 and src2, the sum between
the 16-bit value from src1 and the 16-bit value from src2 is calculated to produce a
16-bit result. The result is placed in the corresponding positions in the dst. The carry
from the lower half add does not affect the upper half add.
31 16 15 0
a_hi a_lo ←src1
+ +
ADD2
= =
31 16 15 0
a_hi + b_hi a_lo + b_lo ←dst
Execution if (cond){
msb16(src1) + msb16(src2) → msb16(dst);
lsb16(src1) + lsb16(src2) → lsb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S, .L, .D
Delay Slots 0
Examples Example 1
4-44 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.18 ADD2
www.ti.com Chapter 4—Instruction Descriptions
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-45
Submit Documentation Feedback
4.19 ADD4
Chapter 4—Instruction Descriptions www.ti.com
4.19 ADD4
Add Without Saturation, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 0 1 0 1 1 1 0 s p
5 1 1 1
Description Performs 2s-complement addition between packed 8-bit quantities. The values in src1
and src2 are treated as packed 8-bit data and the results are written into dst in a packed
8-bit format.
For each pair of packed 8-bit values in src1 and src2, the sum between the 8-bit value
from src1 and the 8-bit value from src2 is calculated to produce an 8-bit result. No
saturation is performed. The carry from one 8-bit add does not affect the add of any
other 8-bit add. The result is placed in the corresponding positions in dst:
• The sum of src1 byte0 and src2 byte0 is placed in byte0 of dst.
• The sum of src1 byte1 and src2 byte1 is placed in byte1 of dst.
• The sum of src1 byte2 and src2 byte2 is placed in byte2 of dst.
• The sum of src1 byte3 and src2 byte3 is placed in byte3 of dst.
4-46 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.19 ADD4
www.ti.com Chapter 4—Instruction Descriptions
31 24 23 16 15 8 7 0
a_3 a_2 a_1 a_0 ←src1
+ + + +
ADD4
b_3 b_2 b_1 b_0 ←src2
= = = =
31 24 23 16 15 8 7 0
a_3 + b_3 a_2 + b_2 a_1 + b_1 a_0 + b_0 ←dst
Execution if (cond){
byte0(src1) + byte0(src2) → byte0(dst);
byte1(src1) + byte1(src2) → byte1(dst);
byte2(src1) + byte2(src2) → byte2(dst);
byte3(src1) + byte3(src2) → byte3(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-47
Submit Documentation Feedback
4.19 ADD4
Chapter 4—Instruction Descriptions www.ti.com
4-48 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.20 AND
www.ti.com Chapter 4—Instruction Descriptions
4.20 AND
Bitwise AND
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
3 5 5 7
spacer
Opcode fields:
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
3 5 5 5 6
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
5 5 5 6
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-49
Submit Documentation Feedback
4.20 AND
Chapter 4—Instruction Descriptions www.ti.com
Opcode fields:
31 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
3 5 5 5 4
Description The AND instruction performs the bit-wise AND between two source registers and
stores the result in a third register.
else nop
Delay Slots 0
Example A0 == 0xffffffff
AND .L 15,A0,A15
A15 == 0x0000000f
A0 == 0xdeadbeef
A1 == 0xbeefbabe
AND .L A0,A1,A2
A2 == 0x9eadbaae
A1 == 0xdeadbeef
A0 == 0x12340000
A3 == 0xbeefbabe
A2 == 0x00006789
AND .L A1:A0,A3:A2,A9:A8
A9 == 0x9eadbaae
A8 == 0x00000000
4-50 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.21 ANDN
www.ti.com Chapter 4—Instruction Descriptions
4.21 ANDN
Bitwise AND Invert
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
3 5 5 5 7
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
5 5 5 7
31 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
31 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
3 5 5 5 4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-51
Submit Documentation Feedback
4.21 ANDN
Chapter 4—Instruction Descriptions www.ti.com
31 30 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
5 5 5 4
Description The ANDN instruction performs the bitwise AND between op1 and the bitwise inverse
of xop2, storing the result in dst.
This instruction can be used in the following way, to perform a multiplex function:
;A13 contains the final multiplexed output of A10 and A11 using A12
OR.L1 A13, A14, A13
Delay Slots 0
Example A0 == 0xdeadbeef
A1 == 0xbeefbabe
ANDN .L A0,A1,A2
A2 == 0x40000441
A0 == 0xaaaaaaaa
A1 == 0xcccccccc
4-52 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.21 ANDN
www.ti.com Chapter 4—Instruction Descriptions
A3 == 0xdeadbeef
A2 == 0x00000000
A13 == 0xbeefbabe
A12 == 0x00000000
ANDN . A3:A2,A13:A12,A11:A10
A11 == 0x40000441
A10 == 0x00000000
A3 == 0x00000000
A2 == 0xffffffff
A13 == 0xffffffff
A12 == 0x00000000
ANDN . A3:A2,A13:A12,A11:A10
A11 == 0x00000000
A10 == 0xffffffff
A3 == 0xffffffff
A2 == 0xaaaaaaaa
A13 == 0xffffffff
A12 == 0xcccccccc
ANDN . A3:A2,A13:A12,A11:A10
A11 == 0x00000000
A10 == 0x22222222
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-53
Submit Documentation Feedback
4.22 AVG2
Chapter 4—Instruction Descriptions www.ti.com
4.22 AVG2
Average, Signed, Packed 16-Bit
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 0 1 1 1 1 0 0 s p
5 1 1 1
Description Performs an averaging operation on packed 16-bit data. For each pair of signed 16-bit
values found in src1 and src2, AVG2 calculates the average of the two values and returns
a signed 16-bit quantity in the corresponding position in the dst.
The averaging operation is performed by adding 1 to the sum of the two 16-bit numbers
being averaged. The result is then right-shifted by 1 to produce a 16-bit result.
4-54 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.22 AVG2
www.ti.com Chapter 4—Instruction Descriptions
31 16 15 0
sa_1 sa_0 ←src1
AVG2
↓ ↓
31 16 15 0
(sa_1 + sb_1 + 1) >> 1 (sa_0 + sb_0 + 1) >> 1 ←dst
Execution if (cond){
((lsb16(src1) + lsb16(src2) + 1) >> 1) → lsb16(dst);
((msb16(src1) + msb16(src2) + 1) >> 1) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-55
Submit Documentation Feedback
4.23 AVGU4
Chapter 4—Instruction Descriptions www.ti.com
4.23 AVGU4
Average, Unsigned, Packed 8-Bit
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 0 1 0 1 1 0 0 s p
5 1 1 1
Description Performs an averaging operation on packed 8-bit data. The values in src1 and src2 are
treated as unsigned, packed 8-bit data and the results are written in unsigned, packed
8-bit format. For each unsigned, packed 8-bit value found in src1 and src2, AVGU4
calculates the average of the two values and returns an unsigned, 8-bit quantity in the
corresponding positions in the dst.
The averaging operation is performed by adding 1 to the sum of the two 8-bit numbers
being averaged. The result is then right-shifted by 1 to produce an 8-bit result.
4-56 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.23 AVGU4
www.ti.com Chapter 4—Instruction Descriptions
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ←src1
AVGU4
↓ ↓ ↓ ↓
31 24 23 16 15 8 7 0
(ua_3 + ub_3 + 1) >> 1 (ua_2 + ub_2 + 1) >> 1 (ua_1 + ub_1 + 1) >> 1 (ua_0 + ub_0 + 1) >> 1 ←dst
Execution if (cond){
((ubyte0(src1) + ubyte0(src2) + 1) >> 1) → ubyte0(dst);
((ubyte1(src1) + ubyte1(src2) + 1) >> 1) → ubyte1(dst);
((ubyte2(src1) + ubyte2(src2) + 1) >> 1) → ubyte2(dst);
((ubyte3(src1) + ubyte3(src2) + 1) >> 1) → ubyte3(dst)
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
A0 1A 2E 5F 4Eh 26 46 95 78 A0 1A 2E 5F 4Eh
unsigned
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-57
Submit Documentation Feedback
4.24 B
Chapter 4—Instruction Descriptions www.ti.com
4.24 B
Branch Using a Displacement
Opcode
31 29 28 27
creg z cst21
3 1 21
7 6 5 4 3 2 1 0
cst21 0 0 1 0 0 s p
21 1 1
Description A 21-bit signed constant, cst21, is shifted left by 2 bits and is added to the address of the
first instruction of the fetch packet that contains the branch instruction. The result is
placed in the program fetch counter (PFC). The assembler/linker automatically
computes the correct value for cst21 by the following formula:
If two branches are in the same execute packet and both are taken, behavior is
undefined.
Two conditional branches can be in the same execute packet if one branch uses a
displacement and the other uses a register, IRP, or NRP. As long as only one branch has
a true condition, the code executes in a well-defined way.
Note—
1) PCE1 (program counter) represents the address of the first instruction in the
fetch packet in the E1 stage of the pipeline.
PFC is the program fetch counter.
2) The execute packets in the delay slots of a branch cannot be interrupted. This
is true regardless of whether the branch is taken.
3) See ‘‘Branching Into the Middle of an Execute Packet’’ on page 3-13 for
information on branching into the middle of an execute packet.
4) A branch to an execute packet that spans two fetch packets will cause a stall
while the second fetch packet is fetched.
4-58 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.24 B
www.ti.com Chapter 4—Instruction Descriptions
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read
Written
Branch taken ✓
Unit in use .S
Delay Slots 5
Example Table 4-2 gives the program counter values and actions for the following code example.
Table 4-2 Program Counter Values for Branch Using a Displacement Example
Cycle Program Counter Value Action
Cycle 0 0000 0000h Branch command executes (target code fetched)
Cycle 1 0000 0004h
Cycle 2 0000 000Ch
Cycle 3 0000 0014h
Cycle 4 0000 0018h
Cycle 5 0000 001Ch
Cycle 6 0000 000Ch Branch target code executes
Cycle 7 0000 0014h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-59
Submit Documentation Feedback
4.25 B
Chapter 4—Instruction Descriptions www.ti.com
4.25 B
Branch Using a Register
unit = .S2
Opcode
31 29 28 27 26 25 24 23 22 18 17 16
creg z 0 0 0 0 0 src2 0 0
3 1 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 1 1 0 1 1 0 0 0 s p
1 1 1
If two branches are in the same execute packet and are both taken, behavior is
undefined.
Two conditional branches can be in the same execute packet if one branch uses a
displacement and the other uses a register, IRP, or NRP. As long as only one branch has
a true condition, the code executes in a well-defined way.
Note—
1) This instruction executes on .S2 only. PFC is program fetch counter.
2) The execute packets in the delay slots of a branch cannot be interrupted. This
is true regardless of whether the branch is taken.
3) See ‘‘Branching Into the Middle of an Execute Packet’’ on page 3-13 for
information on branching into the middle of an execute packet.
4-60 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.25 B
www.ti.com Chapter 4—Instruction Descriptions
4) A branch to an execute packet that spans two fetch packets will cause a stall
while the second fetch packet is fetched.
Execution if (cond)src2 → PFC
else nop
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read src2
Written
Branch taken ✓
Unit in use .S2
Delay Slots 5
Example Table 4-3 on page 4-61 gives the program counter values and actions for the following
code example. In this example, the B10 register holds the value 1000 000Ch.
Table 4-3 Program Counter Values for Branch Using a Register Example
Cycle Program Counter Value Action
Cycle 0 1000 0000h Branch command executes (target code fetched)
Cycle 1 1000 0004h
Cycle 2 1000 000Ch
Cycle 3 1000 0014h
Cycle 4 1000 0018h
Cycle 5 1000 001Ch
Cycle 6 1000 000Ch Branch target code executes
Cycle 7 1000 0014h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-61
Submit Documentation Feedback
4.26 B IRP
Chapter 4—Instruction Descriptions www.ti.com
4.26 B IRP
Branch Using an Interrupt Return Pointer
unit = .S2
Opcode
31 29 28 27 23 22 21 20 19 18 17 16
creg z dst 0 0 1 1 0 0 0
3 1 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 0 0 1 1 1 0 0 0 s p
1 1 1
Description IRP is placed in the program fetch counter (PFC). This instruction also moves the PGIE
bit value to the GIE bit. The PGIE bit is unchanged.
If two branches are in the same execute packet and are both taken, behavior is
undefined.
Two conditional branches can be in the same execute packet if one branch uses a
displacement and the other uses a register, IRP, or NRP. As long as only one branch has
a true condition, the code executes in a well-defined way.
Note—
1) This instruction executes on .S2 only. PFC is the program fetch counter.
2) Refer to Chapter 6 ‘‘Interrupts’’ on page 6-1 for more information on IRP,
PGIE, and GIE.
3) The execute packets in the delay slots of a branch cannot be interrupted. This
is true regardless of whether the branch is taken.
4) See ‘‘Branching Into the Middle of an Execute Packet’’ on page 3-13 for
information on branching into the middle of an execute packet.
4-62 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.26 B IRP
www.ti.com Chapter 4—Instruction Descriptions
5) A branch to an execute packet that spans two fetch packets will cause a stall
while the second fetch packet is fetched.
Execution if (cond)IRP → PFC
else nop
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read IRP
Written
Branch taken ✓
Unit in use .S2
Delay Slots 5
Example Table 4-4 gives the program counter values and actions for the following code example.
Given that an interrupt occurred at
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-63
Submit Documentation Feedback
4.27 B NRP
Chapter 4—Instruction Descriptions www.ti.com
4.27 B NRP
Branch Using NMI Return Pointer
unit = .S2
Opcode
31 29 28 27 23 22 21 20 19 18 17 16
creg z dst 0 0 1 1 1 0 0
3 1 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 0 0 1 1 1 0 0 0 s p
1 1 1
Description NRP is placed in the program fetch counter (PFC). This instruction also sets the NMIE
bit. The PGIE bit is unchanged.
If two branches are in the same execute packet and are both taken, behavior is
undefined.
Two conditional branches can be in the same execute packet if one branch uses a
displacement and the other uses a register, IRP, or NRP. As long as only one branch has
a true condition, the code executes in a well-defined way.
Note—
1) This instruction executes on .S2 only. PFC is program fetch counter.
2) Refer to Chapter 6 ‘‘Interrupts’’ on page 6-1 for more information on NRP and
NMIE.
3) The execute packets in the delay slots of a branch cannot be interrupted. This
is true regardless of whether the branch is taken.
4) See ‘‘Branching Into the Middle of an Execute Packet’’ on page 3-13 for
information on branching into the middle of an execute packet.
4-64 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.27 B NRP
www.ti.com Chapter 4—Instruction Descriptions
5) A branch to an execute packet that spans two fetch packets will cause a stall
while the second fetch packet is fetched.
Execution if (cond)NRP → PFC
else nop
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read NRP
Written
Branch taken ✓
Unit in use .S2
Delay Slots 5
Example Table 4-5 on page 4-65 gives the program counter values and actions for the following
code example. Given that an interrupt occurred at
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-65
Submit Documentation Feedback
4.28 BDEC
Chapter 4—Instruction Descriptions www.ti.com
4.28 BDEC
Branch and Decrement
Opcode
31 29 28 27 23 22
3 1 5 10
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src 1 0 0 0 0 0 0 1 0 0 0 s p
10 1 1
Description If the predication and decrement register (dst) is positive (greater than or equal to 0),
the BDEC instruction performs a relative branch and decrements dst by 1. The
instruction performs the relative branch using a 10-bit signed constant, scst10, in src.
The constant is shifted 2 bits to the left, then added to the address of the first instruction
of the fetch packet that contains the BDEC instruction (PCE1). The result is placed in
the program fetch counter (PFC).
CMPLT.L1 A10,0,A1
[!A1] SUB .L1 A10,1,A10
||[!A1] B .S1 func
NOP5
Note—
1) Only one BDEC instruction can be executed per cycle. The BDEC instruction
can be predicated by using any conventional condition register. The conditions
are effectively ANDed together. If two branches are in the same execute packet,
and if both are taken, behavior is undefined.
2) See ‘‘Branching Into the Middle of an Execute Packet’’ on page 3-13 for
information on branching into the middle of an execute packet.
4-66 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.28 BDEC
www.ti.com Chapter 4—Instruction Descriptions
3) A branch to an execute packet that spans two fetch packets will cause a stall
while the second fetch packet is fetched.
4) The BDEC instruction cannot be in the same execute packet as an ADDKPC
instruction.
Execution if (cond){
if (dst >= 0), PFC = ((PCE1 + se(scst10)) << 2);
if (dst >= 0), dst = dst - 1;
else nop
}
else nop
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read dst
Written dst, PC
Branch taken ✓
Unit in use .S
Delay Slots 5
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-67
Submit Documentation Feedback
4.29 BITC4
Chapter 4—Instruction Descriptions www.ti.com
4.29 BITC4
Bit Count, Packed 8-Bit
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 0 x 0 0 0 0 1 1 1 1 0 0 s p
1 1 1
Description Performs a bit-count operation on 8-bit quantities. The value in src2 is treated as
packed 8-bit data, and the result is written in packed 8-bit format. For each of the 8-bit
quantities in src2, the count of the number of 1 bits in that value is written to the
corresponding position in dst.
4-68 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.29 BITC4
www.ti.com Chapter 4—Instruction Descriptions
31 24 23 16 15 8 7 0
ub_3 ub_2 ub_1 ub_0 ←src2
BITC4
↓ ↓ ↓ ↓
31 24 23 16 15 8 7 0
bit_count(ub_3) bit_count(ub_2) bit_count(ub_1) bit_count(ub_0) ←dst
Execution if (cond){
bit_count(src2(ubyte0)) → ubyte0(dst);
bit_count(src2(ubyte1)) → ubyte1(dst);
bit_count(src2(ubyte2)) → ubyte2(dst);
bit_count(src2(ubyte3)) → ubyte3(dst)
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
A1 9E 52 6E 30h A1 9E 52 6E 30h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-69
Submit Documentation Feedback
4.30 BITR
Chapter 4—Instruction Descriptions www.ti.com
4.30 BITR
Bit Reverse
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 x 0 0 0 0 1 1 1 1 0 0 s p
1 1 1
Description Implements a bit-reversal function that reverses the order of bits in a 32-bit word. This
means that bit 0 of the source becomes bit 31 of the result, bit 1 of the source becomes
bit 30 of the result, bit 2 becomes bit 29, and so on.
31 0
abcd efgh ijklmnop qrstuvwx yzAB CDEF ←src2
BITR
31 0
FEDC BAzy xwvu tsrq ponm lkji hgfedcba ←dst
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
4-70 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.30 BITR
www.ti.com Chapter 4—Instruction Descriptions
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-71
Submit Documentation Feedback
4.31 BNOP
Chapter 4—Instruction Descriptions www.ti.com
4.31 BNOP
Branch Using a Displacement With NOP
Opcode
31 529 28 27 16
creg z src2
3 1 12
15 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 0 0 0 0 1 0 0 1 0 0 0 s p
3 1 1
Description The constant displacement form of the BNOP instruction performs a relative branch
with NOP instructions. The instruction performs the relative branch using the 12-bit
signed constant specified by src2. The constant is shifted 2 bits to the left, then added to
the address of the first instruction of the fetch packet that contains the BNOP
instruction (PCE1). The result is placed in the program fetch counter (PFC).
The 3-bit unsigned constant specified in src1 gives the number of delay slot NOP
instructions to be inserted, from 0 to 7. With src1 = 0, no NOP cycles are inserted.
This instruction helps reduce the number of instructions to perform a branch when
NOP instructions are required to fill the delay slots of a branch.
B .S1 LABEL
NOP N
LABEL: ADD
4-72 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.31 BNOP
www.ti.com Chapter 4—Instruction Descriptions
Note—
1) BNOP instructions may be predicated. The predication condition controls
whether or not the branch is taken, but does not affect the insertion of NOPs.
BNOP always inserts the number of NOPs specified by N, regardless of the
predication condition.
2) The execute packets in the delay slots of a branch cannot be interrupted. This
is true regardless of whether the branch is taken.
3) See ‘‘Branching Into the Middle of an Execute Packet’’ on page 3-13 for
information on branching into the middle of an execute packet.
4) A branch to an execute packet that spans two fetch packets will cause a stall
while the second fetch packet is fetched.
Only one branch instruction can be executed per cycle. If two branches are in the same
execute packet, and if both are taken, the behavior is undefined. It should also be noted
that when a predicated BNOP instruction is used with a NOP count greater than 5, the
CPU inserts the full delay slots requested when the predicated condition is false.
For example, the following set of instructions will insert 7 cycles of NOPs:
ZERO .L1 A0
[A0]BNOP .S1 LABEL,7 ; branch is not taken and
; 7 cycles of NOPs are inserted
Conversely, when a predicated BNOP instruction is used with a NOP count greater
than 5 and the predication condition is true, the branch will be taken and the
multi-cycle NOP is terminated when the branch is taken.
For example in the following set of instructions, only 5 cycles of NOP are inserted:
MVK .D1 1,A0
[A0]BNOP .S1 LABEL,7 ; branch is taken and
; 5 cycles of NOPs are inserted
The BNOP instruction cannot be paired with any other multicycle NOP instruction in
the same execute packet. Instructions that generate a multicycle NOP are: IDLE,
ADDKPC, CALLP, and the multicycle NOP.
The BNOP instruction does not require the use of the .S unit. If no unit is specified,
then it may be scheduled in parallel with instructions executing on both the .S1 and .S2
units. If either the .S1 or .S2 unit is specified for BNOP, then the .S unit specified is not
available for another instruction in the same execute packet. This is enforced by the
assembler.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-73
Submit Documentation Feedback
4.31 BNOP
Chapter 4—Instruction Descriptions www.ti.com
It is possible to branch into the middle of a 32-bit instruction. The only case that will
be detected and result in an exception is when the 32-bit instruction is contained in a
compact header-based fetch packet. The header cannot be the target of a branch
instruction. In the event that the header is the target of a branch, an exception will be
raised.
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read src2
Written PC
Branch taken ✓
Unit in use .S
Delay Slots 5
4-74 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.32 BNOP
www.ti.com Chapter 4—Instruction Descriptions
4.32 BNOP
Branch Using a Register With NOP
unit = .S2
Opcode
31 29 28 27 26 25 24 23 22 18 17 16
creg z 0 0 0 0 1 src2 0 0
3 1 5
15 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 0 1 1 0 0 0 1 p
3 1 1
Description The register form of the BNOP instruction performs an absolute branch with NOP
instructions. The register specified in src2 is placed in the program fetch counter (PFC).
The 3-bit unsigned constant specified in src1 gives the number of delay slots NOP
instructions to be inserted, from 0 to 7. With src1 = 0, no NOP cycles are inserted.
This instruction helps reduce the number of instructions to perform a branch when
NOP instructions are required to fill the delay slots of a branch.
B .S2 B3
NOP N
BNOP.S2 B3,N
Note—
1) BNOP instructions may be predicated. The predication condition controls
whether or not the branch is taken, but does not affect the insertion of NOPs.
BNOP always inserts the number of NOPs specified by N, regardless of the
predication condition.
2) The execute packets in the delay slots of a branch cannot be interrupted. This
is true regardless of whether the branch is taken.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-75
Submit Documentation Feedback
4.32 BNOP
Chapter 4—Instruction Descriptions www.ti.com
3) See ‘‘Branching Into the Middle of an Execute Packet’’ on page 3-13 for
information on branching into the middle of an execute packet.
4) A branch to an execute packet that spans two fetch packets will cause a stall
while the second fetch packet is fetched.
Only one branch instruction can be executed per cycle. If two branches are in the same
execute packet, and if both are taken, the behavior is undefined. It should also be noted
that when a predicated BNOP instruction is used with a NOP count greater than 5, the
CPU inserts the full delay slots requested when the predicated condition is false.
For example, the following set of instructions will insert 7 cycles of NOPs:
ZERO .L1 A0
[A0] BNOP .S2 B3,7; branch is not taken and 7 cycles of NOPs are inserted
Conversely, when a predicated BNOP instruction is used with a NOP count greater
than 5 and the predication condition is true, the branch will be taken and multi-cycle
NOP is terminated when the branch is taken.
For example, in the following set of instructions only 5 cycles of NOP are inserted:
The BNOP instruction cannot be paired with any other multicycle NOP instruction in
the same execute packet. Instructions that generate a multicycle NOP are: IDLE,
ADDKPC, CALLP, and the multicycle NOP.
Execution if (cond){
src2 → PFC;
nop (src1)
}
else nop (src1 + 1)
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read src2
Written PC
Branch taken ✓
Unit in use .S2
Delay Slots 5
4-76 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.32 BNOP
www.ti.com Chapter 4—Instruction Descriptions
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-77
Submit Documentation Feedback
4.33 BPOS
Chapter 4—Instruction Descriptions www.ti.com
4.33 BPOS
Branch Positive
Opcode
31 29 28 27 23 22
3 1 5 10
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src 0 0 0 0 0 0 0 1 0 0 0 s p
10 1 1
Description If the predication register (dst) is positive (greater than or equal to 0), the BPOS
instruction performs a relative branch. If dst is negative, the BPOS instruction takes no
other action.
The instruction performs the relative branch using a 10-bit signed constant, scst10, in
src. The constant is shifted 2 bits to the left, then added to the address of the first
instruction of the fetch packet that contains the BDEC instruction (PCE1). The result
is placed in the program fetch counter (PFC).
Any register can be used that can free the predicate registers (A0-A2 and B0-B2) for
other uses.
Note—
1) Only one BPOS instruction can be executed per cycle. The BPOS instruction
can be predicated by using any conventional condition register. The conditions
are effectively ANDed together. If two branches are in the same execute packet,
and if both are taken, behavior is undefined.
2) The execute packets in the delay slots of a branch cannot be interrupted. This
is true regardless of whether the branch is taken.
3) See ‘‘Branching Into the Middle of an Execute Packet’’ on page 3-13 for
information on branching into the middle of an execute packet.
4) A branch to an execute packet that spans two fetch packets will cause a stall
while the second fetch packet is fetched.
4-78 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.33 BPOS
www.ti.com Chapter 4—Instruction Descriptions
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read dst
Written PC
Branch taken ✓
Unit in use .S
Delay Slots 5
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-79
Submit Documentation Feedback
4.34 CALLP
Chapter 4—Instruction Descriptions www.ti.com
4.34 CALLP
Call Using a Displacement
Opcode
31 30 29 28 27
0 0 0 1 cst21
21
7 6 5 4 3 2 1 0
cst21 0 0 1 0 0 s p
21 1 1
Description A 21-bit signed constant, cst21, is shifted left by 2 bits and is added to the address of the
first instruction of the fetch packet that contains the branch instruction. The result is
placed in the program fetch counter (PFC). The assembler/linker automatically
computes the correct value for cst21 by the following formula:
The address of the execute packet immediately following the execute packet containing
the CALLP instruction is placed in A3, if the S1 unit is used; or in B3, if the S2 unit is
used. This write occurs in E1. An implied NOP 5 is inserted into the instruction
pipeline occupying E2-E6.
Since this branch is taken unconditionally, it cannot be placed in the same execute
packet as another branch. Additionally, no other branches should be pending when the
CALLP instruction is executed.
CALLP, like other relative branch instructions, cannot have an ADDKPC instruction
in the same execute packet with it.
Note—
1) PCE1 (program counter) represents the address of the first instruction in the
fetch packet in the E1 stage of the pipeline. PFC is the program fetch counter.
retPC represents the address of the first instruction of the execute packet in the
DC stage of the pipeline.
4-80 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.34 CALLP
www.ti.com Chapter 4—Instruction Descriptions
2) The execute packets in the delay slots of a branch cannot be interrupted. This
is true regardless of whether the branch is taken.
Execution (cst21 << 2) + PCE1 → PFC
if (unit = S2), retPC → B3
else if (unit = S1), retPC → A3
nop 5
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read
Written A3/B3
Branch taken ✓
Unit in use .S
Delay Slots 5
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-81
Submit Documentation Feedback
4.35 CCMATMPY
Chapter 4—Instruction Descriptions www.ti.com
4.35 CCMATMPY
Complex Conjugate Matrix Multiply, Signed Complex 16-bit (16-bit real/16-bit
Imaginary)
31 30 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
5 5 5 5
Description This instruction performs multiply of conjugate of 1x2 complex vector by a 2x2
complex matrix giving two 64-bit complex results. The input format is a 32-bit complex
number i.e. two 16-bit numbers packed together. The high 16-bits are the real part, and
the low 16-bits are the imaginary part.
The output of the instruction is a 64-bit complex result having 32-bits for the real and
32-bits for the imaginary part. The real part goes in the upper 32-bits of the register
pair, and the imaginary part is in the low 32-bits.
The main difference between executing CCMATMPY and the above sequence is that
saturation is only performed once at the end and intermediate precision is kept at 34
bits
Execution ((msb16(src1_e) x lsb16(src2_0))-(lsb16(src1_e) x msb16(src2_0)))-> tmp0_e
((msb16(src1_e) x msb16(src2_0))+(lsb16(src1_e) x lsb16(src2_0)))-> tmp0_o
((msb16(src1_o) x lsb16(src2_2))-(lsb16(src1_o) x msb16(src2_2)))-> tmp1_e
((msb16(src1_o) x msb16(src2_2))+(lsb16(src1_o) x lsb16(src2_2)))-> tmp1_o
((msb16(src1_e) x lsb16(src2_1))-(lsb16(src1_e) x msb16(src2_1)))-> tmp2_e
4-82 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.35 CCMATMPY
www.ti.com Chapter 4—Instruction Descriptions
Delay Slots 3
CSR= 0x10010000
CSR == 0x10010000 ; 4
A7 == 0x80008000
A6 == 0x80008000
A3 == 0x80008000
A2 == 0x80008000
A1 == 0x80008000
A0 == 0x80008000
CCMATMPY .M A7:A6,A3:A2:A1:A0,A11:A10:A9:A8
A11 == 0x7fffffff
A10 == 0x00000000
A9 == 0x7fffffff
A8 == 0x00000000
CSR= 0x10010200
CSR == 0x10010000 ; 4
B7 == 0x7FFF7FFF
B6 == 0x7FFF8000
B3 == 0x7FFF7FFF
B2 == 0x7FFF8000
B1 == 0x7FFF8000
B0 == 0x7FFF7FFF
CCMATMPY .M B7:B6,B3:B2:B1:B0,B11:B10:B9:B8
B11 == 0x7fffffff
B10 == 0x00000000
B9 == 0xffff0002
B8 == 0x00000000
CSR= 0x10010200
CSR == 0x10010000 ; 4
B7 == 0xFFFFFFFF
B6 == 0xFFFFFFFF
B3 == 0xFFFFFFFF
B2 == 0xFFFFFFFF
B1 == 0xFFFFFFFF
B0 == 0xFFFFFFFF
CCMATMPY .M B7:B6,B3:B2:B1:B0,B11:B10:B9:B8
B11 == 0x00000004
B10 == 0x00000000
B9 == 0x00000004
B8 == 0x00000000
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-83
Submit Documentation Feedback
4.35 CCMATMPY
Chapter 4—Instruction Descriptions www.ti.com
CSR == 0x10010000i
CSR == 0x10010000 ; 4
4-84 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.36 CCMATMPYR1
www.ti.com Chapter 4—Instruction Descriptions
4.36 CCMATMPYR1
Complex Conjugate Matrix Multiply With Rounding, Signed Complex 16-bit (16-bit
Real/16-bit Imaginary)
31 30 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
5 5 5 5
Description This instruction performs multiply of conjugate of a 1x2 complex vector by a 2x2
complex matrix with rounding giving two 32-bit complex results. For example,
CCMATMPYR1 A7:A6,A3:A2:A1:A0,A9:A8 will perform as following:
A3 A2
[A9 A8] = [A7 A6] *
A1 A0
-or-
A9 = conj(A7)*A3 + conj(A6)*A1
A8 = conj(A7)*A2 + conj(A6)*A0
DCONJ A7,A6,A5:A4
CMPYR1 A3,A5,A31
CMPYR1 A1,A4,A29
CMPYR1 A2,A5,A30
CMPYR1 A0,A4,A28
NOP
DSADD A31:A30,A29:A28,A9:A8
The difference between executing CCMATMPYR1 and the above sequence is that
saturation is only performed once at the end and intermediate precision is kept at 34
bits
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-85
Submit Documentation Feedback
4.36 CCMATMPYR1
Chapter 4—Instruction Descriptions www.ti.com
Delay Slots 3
CSR= 0x10010000
CSR == 0x10010000 ; 4
A7 == 0x80008000
A6 == 0x80008000
A3 == 0x80008000
A2 == 0x80008000
A1 == 0x80008000
A0 == 0x80008000
CCMATMPYR1 .M A7:A6,A3:A2:A1:A0,A11:A10
A11 == 0x7fff0000
A10 == 0x7fff0000
CSR= 0x10010200
CSR == 0x10010000 ; 4
B7 == 0x7FFF7FFF
B6 == 0x7FFF8000
B3 == 0x7FFF7FFF
B2 == 0x7FFF8000
B1 == 0x7FFF8000
B0 == 0x7FFF7FFF
CCMATMPYR1 .M B7:B6,B3:B2:B1:B0,B11:B10:
B11 == 0x7fff0000
B10: == 0xfffe0000
CSR == 0x10010200
CSR == 0x10010000 ; 4
B7 == 0xFFFFFFFF
B6 == 0xFFFFFFFF
B3 == 0xFFFFFFFF
B2 == 0xFFFFFFFF
B1 == 0xFFFFFFFF
B0 == 0xFFFFFFFF
CCMATMPYR1 .M B7:B6,B3:B2:B1:B0,B11:B10:
B11 == 0x00000000
B10: == 0x00000000
CSR == 0x10010000
CSR == 0x10010000 ; 4
4-86 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.37 CCMPY32R1
www.ti.com Chapter 4—Instruction Descriptions
4.37 CCMPY32R1
Complex Multiply With Rounding and Conjugate, Signed Complex 16-bit (16-bit
Real/16-bit Imaginary)
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
5 5 5 5
Description The CCMPY32R1 instruction performs one complex multiply between the 64-bit
complex number in src1 and the complex conjugate of the complex 64-bit number in
src2. Each 64-bit complex number is contained in a register pair. The Odd register in
the pair (the most significant word) represents the real component of the complex
number as a 32-bit signed quantity. The Even register in the pair represents the
imaginary component of the complex number. The saturation condition of
0x80000000 * 0x8000000 + 0x80000000 * 0x8000000 is taken into account, yielding a
result of "7FFFFFFF:00000000.
After multiplying and adding the 32-bit numbers together, they are shifted right by 31
and rounded. Intermediate results are calculated at 64-bits.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-87
Submit Documentation Feedback
4.37 CCMPY32R1
Chapter 4—Instruction Descriptions www.ti.com
dwdst.high = tmp_real ;
dwdst.low = tmp_im ;
Delay Slots 3
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x00000000
A2 == 0x08000400
A1 == 0x00000000
A0 == 0x09000200
CCMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00900068
A14 == 0x00000000
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x80000000
A2 == 0x80000000
A1 == 0x80000000
A0 == 0x80000000
CCMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0x7FFFFFFF
A14 == 0x00000000
CSR= 0x00000200
CSR == 0x00000000 ; 4
A3 == 0x7FFF7FFF
A2 == 0x7FFF8000
A1 == 0x7FFF8000
A0 == 0x7FFF7FFF
CCMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0x7fffffff
A14 == 0x00000002
CSR= 0x00000200
CSR == 0x00000000 ; 4
A3 == 0xFFFFFFFF
A2 == 0xFFFFFFFF
A1 == 0xFFFFFFFF
A0 == 0xFFFFFFFF
CCMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00000000
A14 == 0x00000000
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x55555555
A2 == 0x55555555
A1 == 0x55555555
A0 == 0x55555555
CCMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0x71c71c71
4-88 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.37 CCMPY32R1
www.ti.com Chapter 4—Instruction Descriptions
A14 == 0x00000000
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x01234567
A2 == 0x89ABCDEF
A1 == 0x89ABCDEF
A0 == 0x01234567
CCMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0xfde578db
A14 == 0x6D60DCE3
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x80000000
A2 == 0x7fff7fff
A1 == 0x7fff7fff
A0 == 0x7fffffff
CCMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0xffffffff
A14 == 0x7fffffff
CSR= 0x00000200 ; 4
CSR == 0x00000000 ; 4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-89
Submit Documentation Feedback
4.38 CLR
Chapter 4—Instruction Descriptions www.ti.com
4.38 CLR
Clear a Bit Field
or
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 8 7 6 5 4 3 2 1 0
csta cstb 1 1 0 0 1 0 s p
5 5 1 1
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 1 1 1 1 0 0 0 s p
5 1 1 1
4-90 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.38 CLR
www.ti.com Chapter 4—Instruction Descriptions
Description For cstb > csta, the field in src2 as specified by csta to cstb is cleared to all 0s in dst. The
csta and cstb operands may be specified as constants or in the 10 LSBs of the src1
register, with cstb being bits 0−4 (src14..0) and csta being bits 5−9 (src19..5). csta is the LSB
of the field and cstb is the MSB of the field. In other words, csta and cstb represent the
beginning and ending bits, respectively, of the field to be cleared to all 0s in dst. The LSB
location of src2 is bit 0 and the MSB location of src2 is bit 31.
In the following example, csta is 15 and cstb is 23. For the register version of the
instruction, only the 10 LSBs of the src1 register are valid. If any of the 22 MSBs are
non-zero, the result is invalid.
cstb
csta
src2 X X X X X X X X 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
dst X X X X X X X X 0 0 0 0 0 0 0 0 0 X X X X X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
For cstb < csta, the src2 register is copied to dst. The csta and cstb operands may be
specified as constants or in the 10 LSBs of the src1 register, with cstb being bits 0−4
(src14..0) and csta being bits 5−9 (src19..5).
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-91
Submit Documentation Feedback
4.38 CLR
Chapter 4—Instruction Descriptions www.ti.com
A1 07A43F2Ah A1 07A43F2Ah
A2 xxxxxxxxh A2 07A0000Ah
Example 2
B1 03B6E7D5h B1 03B6E7D5h
4-92 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.39 CMATMPY
www.ti.com Chapter 4—Instruction Descriptions
4.39 CMATMPY
Complex Matrix Multiply, Signed Complex 16-bit (16-bit real/16-bit Imaginary)
31 30 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
5 5 5 5
Description This instruction performs multiply of 1x2 complex vector by a 2x2 complex matrix
giving two 64-bit complex results. The input format is a 32-bit complex number i.e. 2
16-bit numbers packed together. The high 16-bits are the real part, and the low 16-bits
are the imaginary part. The output of the instruction is 64-bit complex results having
32-bits for the real and 32-bits for the imaginary part. The real part goes in the upper
32-bits of the register pair, and the imaginary part is in the low 32-bits.
CMPY A3,A7,A31:A30
CMPY A1,A6,A29:A28
CMPY A2,A7,A27:A26
CMPY A0,A6,A25:A24
NOP
DSADD A31:A30,A29:A28,A11:A10
DSADD A27:A26,A25:A24,A9:A8
The difference between executing CMATMPY and the above sequence is that
saturation is only performed once at the end and intermediate precision is kept at 34
bits
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-93
Submit Documentation Feedback
4.39 CMATMPY
Chapter 4—Instruction Descriptions www.ti.com
sat(tmp0_e + tmp1_e)->dst_0
sat(tmp0_o + tmp1_o)->dst_1
sat(tmp2_e + tmp3_e)->dst_2
sat(tmp2_o + tmp3_o)->dst_3
Delay Slots 3
CSR == 0x10010000 ; 4
A7 == 0x80008000
A6 == 0x80008000
A3 == 0x80008000
A2 == 0x80008000
A1 == 0x80008000
A0 == 0x80008000
CMATMPY .M A7:A6,A3:A2:A1:A0,A11:A10:A9:A8
A11 == 0x00000000
A10 == 0x7fffffff
A9 == 0x00000000
A8 == 0x7fffffff
CSR= 0x10010200
CSR == 0x10010000 ; 4
B7 == 0x7FFF7FFF
B6 == 0x7FFF8000
B3 == 0x7FFF7FFF
B2 == 0x7FFF8000
B1 == 0x7FFF8000
B0 == 0x7FFF7FFF
CMATMPY .M B7:B6,B3:B2:B1:B0,B11:B10:B9:B8
B11 == 0xffff0001
B10 == 0xffff0002
B9 == 0x7fffffff
B8 == 0xffff0002
CSR= 0x10010200
CSR == 0x10010000 ; 4
B7 == 0x01234567
B6 == 0x89ABCDEF
B3 == 0x01234567
B2 == 0x89ABCDEF
B1 == 0x89ABCDEF
B0 == 0x01234567
CMATMPY .M B7:B6,B3:B2:B1:B0,B11:B10:B9:B8
B11 == 0x1a186e70
B10 == 0x2ee6b374
B9 == 0x1a186e70
B8 == 0xbf6522f4
CSR= 0x10010000
CSR == 0x10010000 ; 4
4-94 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.40 CMATMPYR1
www.ti.com Chapter 4—Instruction Descriptions
4.40 CMATMPYR1
Complex Matrix Multiply With Rounding, Signed Complex 16-bit (16-bit Real/16-bit
Imaginary)
31 30 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
5 5 5 5
Description This instruction performs multiply of a 1x2 complex vector by a 2x2 complex matrix
with rounding giving two 32-bit complex results. For example, CMATMPYR1
A7:A6,A3:A2:A1:A0,A9:A8 will perform as following:
A3 A2
[A9 A8] = [A7 A6] *
A1 A0
-or-
A9 = A7*A3 + A6*A1
A8 = A7*A2 + A6*A0
CMPYR1 A3,A7,A31
CMPYR1 A1,A6,A29
CMPYR1 A2,A7,A30
CMPYR1 A0,A6,A28
NOP
DSADD A31:A30,A29:A28,A9:A8
The difference between executing CMATMPYR1 and the above sequence is that
saturation is only performed once at the end and intermediate precision is kept at 34
bits
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-95
Submit Documentation Feedback
4.40 CMATMPYR1
Chapter 4—Instruction Descriptions www.ti.com
Delay Slots 3
CSR= 0x10010000
CSR == 0x10010000 ; 4
A7 == 0x80008000
A6 == 0x80008000
A3 == 0x80008000
A2 == 0x80008000
A1 == 0x80008000
A0 == 0x80008000
CMATMPYR1 .M A7:A6,A3:A2:A1:A0,A11:A10:
A11 == 0x00007fff
A10: == 0x00007fff
CSR= 0x10010200
CSR == 0x10010000 ; 4
B7 == 0x7FFF7FFF
B6 == 0x7FFF8000
B3 == 0x7FFF7FFF
B2 == 0x7FFF8000
B1 == 0x7FFF8000
B0 == 0x7FFF7FFF
CMATMPYR1 .M B7:B6,B3:B2:B1:B0,B11:B10:
B11 == 0xfffefffe
B10: == 0x7ffffffe
CSR= 0x10010200
CSR == 0x10010000 ; 4
B7 == 0xFFFFFFFF
B6 == 0xFFFFFFFF
B3 == 0xFFFFFFFF
B2 == 0xFFFFFFFF
B1 == 0xFFFFFFFF
B0 == 0xFFFFFFFF
CMATMPYR1 .M B7:B6,B3:B2:B1:B0,B11:B10:
B11 == 0x00000000
B10: == 0x00000000
CSR= 0x10010000
CSR == 0x10010000 ; 4
4-96 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.41 CMPEQ
www.ti.com Chapter 4—Instruction Descriptions
4.41 CMPEQ
Compare for Equality, Signed Integer
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-97
Submit Documentation Feedback
4.41 CMPEQ
Chapter 4—Instruction Descriptions www.ti.com
Description Compares src1 to src2. If src1 equals src2, then 1 is written to dst; otherwise, 0 is written
to dst.
Execution if (cond){
if (src1 == src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
Example 2
4-98 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.41 CMPEQ
www.ti.com Chapter 4—Instruction Descriptions
A1 0000000Ch 12 A1 0000000Ch
Example 3
A1 F23A3789h A1 F23A3789h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-99
Submit Documentation Feedback
4.42 CMPEQ2
Chapter 4—Instruction Descriptions www.ti.com
4.42 CMPEQ2
Compare for Equality, Packed 16-Bit
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 1 0 1 1 0 0 0 s p
5 1 1 1
Description Performs equality comparisons on packed 16-bit data. Each 16-bit value in src1 is
compared against the corresponding 16-bit value in src2, returning either a 1 if equal
or a 0 if not equal. The equality results are packed into the two least-significant bits of
dst. The result for the lower pair of values is placed in bit 0, and the results for the upper
pair of values are placed in bit 1. The remaining bits of dst are cleared to 0.
31 16 15 0
a_hi a_lo ←src1
CMPEQ2
↓↑ ↓↑
31 16 15 0
b_hi b_lo ←src2
a_lo = = b_lo
a_hi = = b_hi
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = = dst
31 2 1 0
4-100 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.42 CMPEQ2
www.ti.com Chapter 4—Instruction Descriptions
Execution if (cond){
if (lsb16(src1) == lsb16(src2)), 1 → dst0else 0 → dst0;
if (msb16(src1) == msb16(src2)), 1 → dst1else 0 → dst1
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
CMPEQ2 .S1 A3,A4,A5
Example 2
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-101
Submit Documentation Feedback
4.42 CMPEQ2
Chapter 4—Instruction Descriptions www.ti.com
4-102 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.43 CMPEQ4
www.ti.com Chapter 4—Instruction Descriptions
4.43 CMPEQ4
Compare for Equality, Packed 8-Bit
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 1 0 0 1 0 0 0 s p
5 1 1 1
Description Performs equality comparisons on packed 8-bit data. Each 8-bit value in src1 is
compared against the corresponding 8-bit value in src2, returning either a 1 if equal or
a 0 if not equal. The equality comparison results are packed into the four
least-significant bits of dst.
The 8-bit values in each input are numbered from 0 to 3, starting with the
least-significant byte, then working towards the most-significant byte. The comparison
results for byte 0 are written to bit 0 of the result. Likewise the results for byte 1 to 3 are
written to bits 1 to 3 of the result, respectively, as shown in the diagram below. The
remaining bits of dst are cleared to 0.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-103
Submit Documentation Feedback
4.43 CMPEQ4
Chapter 4—Instruction Descriptions www.ti.com
31 24 23 16 15 8 7 0
sa_3 sa_2 sa_1 sa_0 ←src1
CMPEQ4
↓↑ ↓↑ ↓↑ ↓↑
31 24 23 16 15 8 7 0
sb_3 sb_2 sb_1 sb_0 ←src2
sa_0 = = sb_0
sa_1 = = sb_1
sa_2 = = sb_2
sa_3 = = sb_3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = = = = dst
31 4 3 2 1 0
Execution if (cond){
if (sbyte0(src1) == sbyte0(src2)), 1 → dst0else 0 → dst0;
if (sbyte1(src1) == sbyte1(src2)), 1 → dst1else 0 → dst1;
if (sbyte2(src1) == sbyte2(src2)), 1 → dst2else 0 → dst2;
if (sbyte3(src1) == sbyte3(src2)), 1 → dst3else 0 → dst3
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
See Also CMPEQ, CMPEQ2, CMPGTU4, XPND4
Examples Example 1
A3 02 3A 4E 1Ch A3 02 3A 4E 1Ch
A4 02 B8 4E 76h A4 02 B8 4E 76h
4-104 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.43 CMPEQ4
www.ti.com Chapter 4—Instruction Descriptions
Example 2
B2 F2 3A 37 89h B2 F2 3A 37 89h
B8 04 B8 37 89h B8 04 B8 37 89h
B13 xxxx xxxxh B13 0000 0003h false, false, true, true
Example 3
B2 01 B6 24 51h B2 01 B6 24 51h
B8 05 B6 24 51h B8 05 B6 24 51h
B13 xxxx xxxxh B13 0000 0007h false, true, true, true
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-105
Submit Documentation Feedback
4.44 CMPEQDP
Chapter 4—Instruction Descriptions www.ti.com
4.44 CMPEQDP
Compare for Equality, Double-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 0 0 0 1 0 0 0 s p
5 1 1 1
Description Compares src1 to src2. If src1 equals src2, then 1 is written to dst; otherwise, 0 is written
to dst.
Note—
1) In the case of NaN compared with itself, the result is false.
4-106 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.44 CMPEQDP
www.ti.com Chapter 4—Instruction Descriptions
2) No configuration bits other than those in the preceding table are set, except the
NaNn and DENn bits when appropriate.
Execution if (cond){
if (src1 == src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1_l, src2_l src1_h, src2_h
Written dst
Unit in use .S .S
Delay Slots 1
A1:A0 4021 3333h 3333 3333h A1:A0 4021 3333h 3333 3333h 8.6
A3:A2 C004 0000h 0000 0000h A3:A2 C004 0000h 0000 0000h -2.5
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-107
Submit Documentation Feedback
4.45 CMPEQSP
Chapter 4—Instruction Descriptions www.ti.com
4.45 CMPEQSP
Compare for Equality, Single-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 0 0 0 1 0 0 0 s p
5 1 1 1
Description Compares src1 to src2. If src1 equals src2, then 1 is written to dst; otherwise, 0 is written
to dst.
Note—
1) In the case of NaN compared with itself, the result is false.
4-108 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.45 CMPEQSP
www.ti.com Chapter 4—Instruction Descriptions
2) No configuration bits other than those in the preceding table are set, except the
NaNn and DENn bits when appropriate.
Execution if (cond){
if (src1 == src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-109
Submit Documentation Feedback
4.46 CMPGT
Chapter 4—Instruction Descriptions www.ti.com
4.46 CMPGT
Compare for Greater Than, Signed Integers
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
Description Performs a signed comparison of src1 to src2. If src1 is greater than src2, then a 1 is
written to dst; otherwise, a 0 is written to dst.
These two instructions are equivalent, with the second instruction using the
conventional operand types for src1 and src2.
4-110 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.46 CMPGT
www.ti.com Chapter 4—Instruction Descriptions
In both of these operations the listing file (.lst) will have the first
implementation, and the second implementation will appear in the debugger.
Execution if (cond){
if (src1 > src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-111
Submit Documentation Feedback
4.46 CMPGT
Chapter 4—Instruction Descriptions www.ti.com
Example 3
Example 4
4-112 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.47 CMPGT2
www.ti.com Chapter 4—Instruction Descriptions
4.47 CMPGT2
Compare for Greater Than, Packed 16-Bit
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 0 0 1 0 0 0 s p
5 1 1 1
Description Performs comparisons for greater than values on signed, packed 16-bit data. Each
signed 16-bit value in src1 is compared against the corresponding signed 16-bit value
in src2, returning a 1 if src1 is greater than src2 or returning a 0 if it is not greater. The
comparison results are packed into the two least-significant bits of dst. The result for
the lower pair of values is placed in bit 0, and the results for the upper pair of values are
placed in bit 1. The remaining bits of dst are cleared to 0.
31 16 15 0
a_hi a_lo ←src1
CMPGT2
↓↑ ↓↑
31 16 15 0
b_hi b_lo ←src2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-113
Submit Documentation Feedback
4.47 CMPGT2
Chapter 4—Instruction Descriptions www.ti.com
Execution if (cond){
if (lsb16(src1) > lsb16(src2)), 1 → dst0else 0 → dst0;
if (msb16(src1) > msb16(src2)), 1 → dst1else 0 → dst1
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
CMPGT2 .S1 A3,A4,A5
Example 2
Example 3
4-114 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.47 CMPGT2
www.ti.com Chapter 4—Instruction Descriptions
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-115
Submit Documentation Feedback
4.48 CMPGTDP
Chapter 4—Instruction Descriptions www.ti.com
4.48 CMPGTDP
Compare for Greater Than, Double-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 0 0 1 1 0 0 0 s p
5 1 1 1
Description Compares src1 to src2. If src1 is greater than src2, then 1 is written to dst; otherwise, 0
is written to dst.
4-116 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.48 CMPGTDP
www.ti.com Chapter 4—Instruction Descriptions
Note—No configuration bits other than those in the preceding table are set,
except the NaNn and DENn bits when appropriate.
Execution if (cond){
if (src1 > src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1_l, src2_l src1_h, src2_h
Written dst
Unit in use .S .S
Delay Slots 1
Functional Unit Latency 2
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 3333 3333h
A3:A2 C004 0000h 0000 0000h -2.5 A3:A2 C004 0000h 0000 0000h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-117
Submit Documentation Feedback
4.49 CMPGTSP
Chapter 4—Instruction Descriptions www.ti.com
4.49 CMPGTSP
Compare for Greater Than, Single-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 0 0 1 1 0 0 0 s p
5 1 1 1
Description Compares src1 to src2. If src1 is greater than src2, then 1 is written to dst; otherwise, 0
is written to dst.
4-118 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.49 CMPGTSP
www.ti.com Chapter 4—Instruction Descriptions
Note—No configuration bits other than those in the preceding table are set,
except the NaNn and DENn bits when appropriate.
Execution if (cond){
if (src1 > src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Functional Unit Latency 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-119
Submit Documentation Feedback
4.50 CMPGTU
Chapter 4—Instruction Descriptions www.ti.com
4.50 CMPGTU
Compare for Greater Than, Unsigned Integers
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
4-120 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.50 CMPGTU
www.ti.com Chapter 4—Instruction Descriptions
Description Performs an unsigned comparison of src1 to src2. If src1 is greater than src2, then a 1 is
written to dst; otherwise, a 0 is written to dst. Only the four LSBs are valid in the 5-bit
dst field when the ucst4 operand is used. If the MSB of the dst field is nonzero, the result
is invalid.
Execution if (cond){
if (src1 > src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-121
Submit Documentation Feedback
4.50 CMPGTU
Chapter 4—Instruction Descriptions www.ti.com
Example 3
4-122 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.51 CMPGTU4
www.ti.com Chapter 4—Instruction Descriptions
4.51 CMPGTU4
Compare for Greater Than, Unsigned, Packed 8-Bit
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 0 1 1 0 0 0 s p
5 1 1 1
Description Performs comparisons for greater than values on packed 8-bit data. Each unsigned
8-bit value in src1 is compared against the corresponding unsigned 8-bit value in src2,
returning a 1 if the byte in src1 is greater than the corresponding byte in src2 or a 0 if is
not greater. The comparison results are packed into the four least-significant bits of dst.
The 8-bit values in each input are numbered from 0 to 3, starting with the
least-significant byte, then working towards the most-significant byte. The comparison
results for byte 0 are written to bit 0 of the result. Likewise, the results for byte 1 to 3 are
written to bits 1 to 3 of the result, respectively, as shown in the diagram below. The
remaining bits of dst are cleared to 0.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-123
Submit Documentation Feedback
4.51 CMPGTU4
Chapter 4—Instruction Descriptions www.ti.com
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ←src1
CMPGTU4
↓↑ ↓↑ ↓↑ ↓↑
31 24 23 16 15 8 7 0
ub_3 ub_2 ub_1 ub_0 ←src2
Execution if (cond){
if (ubyte0(src1) > ubyte0(src2)), 1 → dst0else 0 → dst0;
if (ubyte1(src1) > ubyte1(src2)), 1 → dst1else 0 → dst1;
if (ubyte2(src1) > ubyte2(src2)), 1 → dst2else 0 → dst2;
if (ubyte3(src1) > ubyte3(src2)), 1 → dst3else 0 → dst3
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
4-124 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.51 CMPGTU4
www.ti.com Chapter 4—Instruction Descriptions
Example 2
B13 xxxx xxxxh B13 0000 000Eh true, true, true, false
Example 3
B13 xxxx xxxxh B13 0000 0002h false, false, true, false
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-125
Submit Documentation Feedback
4.52 CMPLT
Chapter 4—Instruction Descriptions www.ti.com
4.52 CMPLT
Compare for Less Than, Signed Integers
Opcode
31 129 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
Description Performs a signed comparison of src1 to src2. If src1 is less than src2, then 1 is written
to dst; otherwise, 0 is written to dst.
These two instructions are equivalent, with the second instruction using the
conventional operand types for src1 and src2.
4-126 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.52 CMPLT
www.ti.com Chapter 4—Instruction Descriptions
In both of these operations the listing file (.lst) will have the first
implementation, and the second implementation will appear in the debugger.
Execution if (cond){
if (src1 < src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
Example 2
CMPLT .L1 A1,A2,A3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-127
Submit Documentation Feedback
4.52 CMPLT
Chapter 4—Instruction Descriptions www.ti.com
Example 3
A1 00000005h 5 A1 00000005h
4-128 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.53 CMPLT2
www.ti.com Chapter 4—Instruction Descriptions
4.53 CMPLT2
Compare for Less Than, Packed 16-Bit
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 0 0 1 0 0 0 s p
5 1 1 1
The assembler uses the operation CMPGT2 (.unit) src1, src2, dst to perform this task
(see CMPGT).
Execution if (cond){
if (lsb16(src2) < lsb16(src1)), 1 → dst0else 0 → dst0;
if (msb16(src2) < msb16(src1)), 1 → dst1else 0 → dst1
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-129
Submit Documentation Feedback
4.53 CMPLT2
Chapter 4—Instruction Descriptions www.ti.com
Example 2
Example 3
4-130 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.54 CMPLTDP
www.ti.com Chapter 4—Instruction Descriptions
4.54 CMPLTDP
Compare for Less Than, Double-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 0 1 0 1 0 0 0 s p
5 1 1 1
Description Compares src1 to src2. If src1 is less than src2, then 1 is written to dst; otherwise, 0 is
written to dst.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-131
Submit Documentation Feedback
4.54 CMPLTDP
Chapter 4—Instruction Descriptions www.ti.com
Note—No configuration bits other than those in the preceding table are set,
except the NaNn and DENn bits when appropriate.
Execution if (cond){
if (src1 < src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1_l, src2_l src1_h, src2_h
Written dst
Unit in use .S .S
Delay Slots 1
Functional Unit Latency 2
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 3333 3333h
B3:B2 C004 0000h 0000 0000h -2.5 B3:B2 C004 0000h 0000 0000h
4-132 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.55 CMPLTSP
www.ti.com Chapter 4—Instruction Descriptions
4.55 CMPLTSP
Compare for Less Than, Single-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 0 1 0 1 0 0 0 s p
5 1 1 1
Description Compares src1 to src2. If src1 is less than src2, then 1 is written to dst; otherwise, 0 is
written to dst.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-133
Submit Documentation Feedback
4.55 CMPLTSP
Chapter 4—Instruction Descriptions www.ti.com
Note—No configuration bits other than those in the preceding table are set,
except the NaNn and DENn bits when appropriate.
Execution if (cond){
if (src1 < src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Functional Unit Latency 1
4-134 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.56 CMPLTU
www.ti.com Chapter 4—Instruction Descriptions
4.56 CMPLTU
Compare for Less Than, Unsigned Integers
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-135
Submit Documentation Feedback
4.56 CMPLTU
Chapter 4—Instruction Descriptions www.ti.com
Description Performs an unsigned comparison of src1 to src2. If src1 is less than src2, then 1 is
written to dst; otherwise, 0 is written to dst.
Execution if (cond){
if (src1 < src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
Example 2
4-136 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.56 CMPLTU
www.ti.com Chapter 4—Instruction Descriptions
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-137
Submit Documentation Feedback
4.57 CMPLTU4
Chapter 4—Instruction Descriptions www.ti.com
4.57 CMPLTU4
Compare for Less Than, Unsigned, Packed 8-Bit
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 0 1 1 0 0 0 s p
5 1 1 1
The 8-bit values in each input are numbered from 0 to 3, starting with the
least-significant byte, and moving towards the most-significant byte. The comparison
results for byte 0 are written to bit 0 of the result. Similarly, the results for byte 1 to 3
are written to bits 1 to 3 of the result, respectively, as shown in the diagram below. The
remaining bits of dst are cleared to 0.
4-138 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.57 CMPLTU4
www.ti.com Chapter 4—Instruction Descriptions
The assembler uses the operation CMPGTU4 (.unit) src1, src2, dst to perform this task
(see CMPGTU4).
Execution if (cond){
if (ubyte0(src2) < ubyte0(src1)), 1 → dst0else 0 → dst0;
if (ubyte1(src2) < ubyte1(src1)), 1 → dst1else 0 → dst1;
if (ubyte2(src2) < ubyte2(src2)), 1 → dst2else 0 → dst2;
if (ubyte3(src2) < ubyte3(src1)), 1 → dst3else 0 → dst3
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
Example 2
Example 3
CMPLTU4 .S2 B8,B2,B13; assembler treats as CMPGTU4 B2,B8,B13
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-139
Submit Documentation Feedback
4.57 CMPLTU4
Chapter 4—Instruction Descriptions www.ti.com
4-140 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.58 CMPY
www.ti.com Chapter 4—Instruction Descriptions
4.58 CMPY
Complex Multiply Two Pairs, Signed, Packed 16-Bit
Opcode
31 30 29 28 27 23 22 18 17
5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 0 1 0 1 1 0 0 s p
5 1 1 1
Description Returns two dot-products between two pairs of signed, packed 16-bit values. The values
in src1 and src2 are treated as signed, packed 16-bit quantities. The signed results are
written to a 64-bit register pair.
The product of the lower halfwords of src1 and src2 is subtracted from the product of
the upper halfwords of src1 and src2. The result is written to dst_o.
The product of the upper halfword of src1 and the lower halfword of src2 is added to
the product of the lower halfword of src1 and the upper halfword of src2. The result is
written to dst_e.
If the result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written one
cycle after the result is written to dst_e.
Note—In the overflow case, where all four halfwords in src1 and src2 are 8000h,
the saturation value 7FFF FFFFh is written into the 32-bit dst_e register.
Execution sat((lsb16(src1) × msb16(src2)) + (msb16(src1) × lsb16(src2))) → dst_e
(msb16(src1) × msb16(src2)) - (lsb16(src1) × lsb16(src2)) → dst_o
Delay Slots 3
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-141
Submit Documentation Feedback
4.58 CMPY
Chapter 4—Instruction Descriptions www.ti.com
Example 2
Example 3
CMPY .M1 A0,A1,A3:A2
4-142 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.59 CMPY32R1
www.ti.com Chapter 4—Instruction Descriptions
4.59 CMPY32R1
Complex Multiply With Rounding, Signed Complex 32-bit (32-bit Real/32-bit
Imaginary)
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
5 5 5 5
Description The CMPY32R1 instruction performs one complex multiply between two 64-bit
complex numbers. The 64-bit complex number is contained in a register pair. The Odd
register in the pair (the most significant word) represents the real component of the
complex number as a 32-bit signed quantity. The Even register in the pair represents
the imaginary component of the complex number.
After multiplying and adding the 32-bit numbers together, they are shifted right by 31,
rounded and saturated to 32-bits Intermediate results are calculated at 64-bits.
Delay Slots 3
Functional Unit Latency 1
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x00000000
A2 == 0x08000400
A1 == 0x00000000
A0 == 0x09000200
CMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0xff6fff98
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-143
Submit Documentation Feedback
4.59 CMPY32R1
Chapter 4—Instruction Descriptions www.ti.com
A14 == 0x00000000
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x80000000
A2 == 0x80000000
A1 == 0x80000000
A0 == 0x80000000
CMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00000000
A14 == 0x7fffffff
CSR= 0x00000200
CSR == 0x00000000 ; 4
A3 == 0x7FFF7FFF
A2 == 0x7FFF8000
A1 == 0x7FFF8000
A0 == 0x7FFF7FFF
CMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00000000
A14 == 0x7fffffff
CSR= 0x00000200
CSR == 0x00000000 ; 4
A3 == 0xFFFFFFFF
A2 == 0xFFFFFFFF
A1 == 0xFFFFFFFF
A0 == 0xFFFFFFFF
CMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00000000
A14 == 0x00000000
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x55555555
A2 == 0x55555555
A1 == 0x55555555
A0 == 0x55555555
CMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00000000
A14 == 0x71c71c71
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x01234567
A2 == 0x89ABCDEF
A1 == 0x89ABCDEF
A0 == 0x01234567
CMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00000000
A14 == 0x6d660a7f
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x7fff7fff
A2 == 0x80000000
A1 == 0x7fff7fff
A0 == 0x7fffffff
CMPY32R1 .M A3:A2,A1:A0,A15:A14
A15 == 0x7fffffff
A14 == 0xffffffff
CSR= 0x00000200 ; 4
CSR == 0x00000000 ; 4
B3 == 0x80000000
B2 == 0x80000000
B1 == 0x80000000
B0 == 0x7fffffff
CMPY32R1 .M B3:B2,B1:B0,B15:B14
B15 == 0x7fffffff
B14 == 0x00000001
CSR= 0x00000200 ; 4
4-144 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.59 CMPY32R1
www.ti.com Chapter 4—Instruction Descriptions
CSR == 0x00000000 ; 4
B3 == 0x80000000
B2 == 0x80000000
B1 == 0x7fffffff
B0 == 0x7fffffff
CMPY32R1 .M B3:B2,B1:B0,B15:B14
B15 == 0x00000000
B14 == 0x80000000
CSR= 0x00000200 ; 4
CSR == 0x00000000 ; 4
B3 == 0x80000000
B2 == 0x7fffffff
B1 == 0x7fffffff
B0 == 0x7fffffff
CMPY32R1 .M B3:B2,B1:B0,B15:B14
B15 == 0x80000000
B14 == 0xffffffff
CSR= 0x00000200 ; 4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-145
Submit Documentation Feedback
4.60 CMPYR
Chapter 4—Instruction Descriptions www.ti.com
4.60 CMPYR
Complex Multiply Two Pairs, Signed, Packed 16-Bit With Rounding
Opcode
31 30 29 28 27 23 22 18 17
5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 0 1 1 1 1 0 0 s p
5 1 1 1
Description Performs two dot-products between two pairs of signed, packed 16-bit values. The
values in src1 and src2 are treated as signed, packed 16-bit quantities. The signed results
are rounded with saturation, shifted, packed and written to a 32-bit register.
The product of the lower halfwords of src1 and src2 is subtracted from the product of
the upper halfwords of src1 and src2. The result is rounded by adding 215 to it. The 16
most-significant bits of the rounded value are written to the upper half of dst.
The product of the upper halfword of src1 and the lower halfword of src2 is added to
the product of the lower halfword of src1 and the upper halfword of src2. The result is
rounded by adding 215 to it. The 16 most-significant bits of the rounded value are
written to the lower half of dst.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written
one cycle after the result is written to dst.
Delay Slots 3
Examples Example 1
CMPYR .M1 A0,A1,A2
4-146 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.60 CMPYR
www.ti.com Chapter 4—Instruction Descriptions
A1 0900 0200h
Example 2
A1 7FFF 8000h
Example 3
CMPYR .M1 A0,A1,A2
A1 8000 8000h
Example 4
B1 8000 8001h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-147
Submit Documentation Feedback
4.60 CMPYR
Chapter 4—Instruction Descriptions www.ti.com
4-148 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.61 CMPYR1
www.ti.com Chapter 4—Instruction Descriptions
4.61 CMPYR1
Complex Multiply Two Pairs, Signed, Packed 16-Bit With Rounding
Opcode
31 30 29 28 27 23 22 18 17
5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 0 0 1 1 0 0 s p
5 1 1 1
Description Performs two dot-products between two pairs of signed, packed 16-bit values. The
values in src1 and src2 are treated as signed, packed 16-bit quantities. The signed results
are rounded with saturation to 31 bits, shifted, packed and written to a 32-bit register.
The product of the lower halfwords of src1 and src2 is subtracted from the product of
the upper halfwords of src1 and src2. The intermediate result is rounded by adding 214
to it. This value is shifted left by 1 with saturation. The 16 most-significant bits of the
shifted value are written to the upper half of dst.
The product of the upper halfword of src1 and the lower halfword of src2 is added to
the product of the lower halfword of src1 and the upper halfword of src2. The
intermediate result is rounded by adding 214 to it. This value is shifted left by 1 with
saturation. The 16 most-significant bits of the shifted value are written to the lower half
of dst.
If either result saturates in the rounding or shifting process, the M1 or M2 bit in SSR
and the SAT bit in CSR are written one cycle after the results are written to dst.
Delay Slots 3
Examples Example 1
CMPYR1 .M1 A0,A1,A2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-149
Submit Documentation Feedback
4.61 CMPYR1
Chapter 4—Instruction Descriptions www.ti.com
A1 0900 0200h
Example 2
A1 7FFF 8000h
Example 3
A1 8000 8000h
Example 4
4-150 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.61 CMPYR1
www.ti.com Chapter 4—Instruction Descriptions
B1 8000 8001h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-151
Submit Documentation Feedback
4.62 CMPYSP
Chapter 4—Instruction Descriptions www.ti.com
4.62 CMPYSP
Single Precision Complex Floating Point Multiply
31 30 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
5 5 5 5
Description This instruction performs a complex multiply of two Single Precision Floating-Point
numbers in a register pair giving a 128-bit output.
• The product of the lower word of src1 and the upper word of src2 to is placed into
dst_0.
• The product of the lower word of src1 and the lower word of src2 is negated and
placed into dst_1.
• The product of the upper word of src1 and the lower word of src2 to is placed into
dst_2.
• The product of the upper word of src1 and the upper word of src2 to is placed into
dst_3.
Special Cases:
• If one source is SNaN or QNaN, the result is a signed NaN_out and the NANn bit
is set. If either source is SNaN, the INVAL bit is set also. The sign of NaN_out is
the XOR to the input signs.
• Signed infinity multiplied by signed infinity or a normalized number (other than
signed zero) returns signed infinity. Signed infinity multiplied by signed zero (or
denormal) returns a signed NaN_out and sets the INVAL bit.
• If one or both source are signed zero, the result is signed zero unless the other
source is a NaN or signed infinity, in which case the result is signed NaN_out.
• If signed zero is multiplied by signed infinity, the result is signed NaN_out and
the INVAL bit is set.
• A denormalized source is treated as signed zero and the DENn bit is set. The
INEX bit is set except when the other source is signed infinity, signed NaN, OR
signed zero. Therefore, a signed infinity multiplied by a denormalized number
gives a signed NaN_out and sets the INVAL bit.
• If rounding is performed, the INEX bit is set.
4-152 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.62 CMPYSP
www.ti.com Chapter 4—Instruction Descriptions
Delay Slots 3
FMCR= 0x00000080
FMCR== 0x00000000
A9 == 0xc0200000
A8 == 0x40600000
A7 == 0x40500000
A6 == 0x42510000
CMPYSP .M A9:A8,A7:A6,A3:A2:A1:A0
A3 == 0xc1020000
A2 == 0xc302a000
A1 == 0xc336e000
A0 == 0x41360000
FMCR= 0x00000000
FMCR== 0x00000000
A9 == 0x7fc00000
A8 == 0x42510000
A7 == 0x40600000
A6 == 0xc0200000
CMPYSP .M A9:A8,A7:A6,A3:A2:A1:A0
A3 == 0x7fffffff
A2 == 0xffffffff
A1 == 0x4302a000
A0 == 0x4336e000
FMCR= 0x00000001
FMCR== 0x00000000
A9 == 0xffc00000
A8 == 0x42510000
A7 == 0x40600000
A6 == 0xc0200000
CMPYSP .M A9:A8,A7:A6,A3:A2:A1:A0
A3 == 0xffffffff
A2 == 0x7fffffff
A1 == 0x4302a000
A0 == 0x4336e000
FMCR= 0x00000001
FMCR== 0x00000000
B9 == 0x7fc00000
B8 == 0x42510000
B3 == 0x7f900000
B2 == 0xc0200000
CMPYSP .M B9:B8,B3:B2,B7:B6:B5:B4
B7 == 0x7fffffff
B6 == 0xffffffff
B5 == 0x4302a000
B4 == 0x7fffffff
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-153
Submit Documentation Feedback
4.62 CMPYSP
Chapter 4—Instruction Descriptions www.ti.com
FMCR= 0x00130000
FMCR== 0x00000000
B9 == 0xff900033
B8 == 0x7f800000
B3 == 0x7fc00000
B2 == 0xc0200000
CMPYSP .M B9:B8,B3:B2,B7:B6:B5:B4
B7 == 0xffffffff
B6 == 0x7fffffff
B5 == 0x7f800000
B4 == 0x7fffffff
FMCR= 0x00330000
FMCR== 0x00000000
B9 == 0x7f800000
B8 == 0x00802000
B3 == 0x7fc00000
B2 == 0xc0200000
CMPYSP .M B9:B8,B3:B2,B7:B6:B5:B4
B7 == 0x7fffffff
B6 == 0xff800000
B5 == 0x01202800
B4 == 0x7fffffff
FMCR= 0x00220000
FMCR== 0x00000000
B9 == 0x7f800000
B8 == 0x00802000
B3 == 0x00000000
B2 == 0xc0200000
CMPYSP .M B9:B8,B3:B2,B7:B6:B5:B4
B7 == 0x7fffffff
B6 == 0xff800000
B5 == 0x01202800
B4 == 0x00000000
FMCR= 0x00300000
FMCR== 0x00000000
B9 == 0x00000000
B8 == 0x00802000
B3 == 0x7f800000
B2 == 0x40600000
CMPYSP .M B9:B8,B3:B2,B7:B6:B5:B4
B7 == 0x7fffffff
B6 == 0x00000000
B5 == 0x81603800
B4 == 0x7f800000
FMCR= 0x00300000
4-154 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.63 CROT270
www.ti.com Chapter 4—Instruction Descriptions
4.63 CROT270
Complex Rotate By 270 Degrees, Signed Complex 16-bit (16-bit Real/16-bit Imaginary)
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 5 5 4 3 2 1 0
3 5 5 5
-Description Returns the 270 degree rotation of the input complex number. This is the same as
multiplying the number by -j.
The input format is 2 packed signed 16-bit numbers, bits 31 through 16 are the real
portion of the number, and bits 15 through 0 are the imaginary part.
The real input is returned as the imaginary output, and the imaginary portion is
negated with saturation, and returned as the real portion. E.g.,
Execution if(cond) {
lsb16(src1) -> msb16(dst)
sat(-msb16(src1)) -> lsb16(dst)
}
else nop
Delay Slots 0
Example A0 == 0x12345678
CROT270 .L A0,A15
A15 == 0x5678EDCC
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-155
Submit Documentation Feedback
4.64 CROT90
Chapter 4—Instruction Descriptions www.ti.com
4.64 CROT90
Complex Rotate By 90 Degrees, Signed Complex 16-bit (16-bit Real/16-bit Imaginary)
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 5 5 4 3 2 1 0
3 5 5 5
Description Returns the 90 degree rotation of the input complex number. This is the same as
multiplying the number by j.
The input format is 2 packed signed 16-bit numbers, bits 31 through 16 are the real
portion of the number, and bits 15 through 0 are the imaginary part.
The real input is returned as the imaginary output, and the imaginary portion is
negated with saturation, and returned as the real portion. E.g.,
Execution if(cond) {
sat(-lsb16(src1)) -> msb16(dst)
msb16(src1) -> lsb16(dst)
}
else nop
Delay Slots 0
Functional Unit Latency 1
Example A0 == 0x12345678
CROT90 .L A0,A15
A15 == 0xA9881234
4-156 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.65 DADD
www.ti.com —
4.65 DADD
2-Way SIMD Addition, Packed Signed 32-bit
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
3 5 5 5 6
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
5 5 5 5
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
5 5 5 7
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-157
Submit Documentation Feedback
4.65 DADD
— www.ti.com
Description The DADD instruction performs two 32-bit additions of the packed 32-bit numbers
contained in the two source register pairs. The addition results are returned as two
32-bit results packed into dwdst.
63 32 31 0
high1 low1 ←dwop1
+ +
DADD
v v
Execution if(cond) {
src1_e + src2_e -> dst_e
src1_o + src2_o -> dst_o
}
else nop
Delay Slots 0
Example A1 == 0x00000010
A0 == 0x00050011
DADD .L -16,A1:A0,A15:A14
A15 == 0x00000000
A14 == 0x00050001
A1 == 0x44444444
A0 == 0x7fffffff
A3 == 0xccccc444
A2 == 0x00000001
DADD .L A1:A0,A3:A2,A15:A14
A15 == 0x11110888
A14 == 0x80000000
4-158 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.66 DADD2
www.ti.com —
4.66 DADD2
4-Way SIMD Addition, Packed Signed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
5 5 5 6
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
5 5 5 7
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-159
Submit Documentation Feedback
4.66 DADD2
— www.ti.com
Description The DADD2 instruction performs four 16-bit additions of the packed 16-bit numbers
contained in the two 64-bit wide source registers. The addition results are returned as
four 16-bit results packed into dst.
63 48 47 32 31 16 15 0
A B C D ←dwop1
W X Y Z ←xdwop2
v v v v
Delay Slots 0
Example A1 == 0x44444444
A0 == 0x7fff0002
A3 == 0xccccc444
A2 == 0x7ff00005
DADD2 .L A1:A0,A3:A2,A15:A14
A15 == 0x11100888
A14 == 0xffef0007
4-160 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.67 DADDSP
www.ti.com —
4.67 DADDSP
2-Way SIMD Single Precision Floating Point Addition
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
5 5 5 7
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
5 5 5 6
Description Performs a SIMD single-precision floating point ADD on the pairs of numbers in
dwop1 and xdwop2.
The values in even registers of dwop1 and xdwop2 are added. The result is placed in
the even register in the destination register pair dwdst. The values in odd registers of
dwop1 and xdwop2 are added. The result is placed in the even register in the
destination register pair dwdst.
and
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-161
Submit Documentation Feedback
4.67 DADDSP
— www.ti.com
Delay Slots 2
Example A3 == 0x41200000
A2 == 0xC1200000
A5 == 0x42300000
A4 == 0x42300000
DADDSP .L A3:A2,A5:A4,A1:A0
A1 == 0x42580000
A0 == 0x42080000
;[10]+[44]->[54];[-10]+[44]->[34];
A3 == 0x41200000
A2 == 0xC1200000
A5 == 0xC2300000
A4 == 0xC2300000
DADDSP .L A3:A2,A5:A4,A1:A0
A1 == 0xC2080000
A0 == 0xC2580000
;[10]+[-44]->[-34];[-10]+[-44]->[-54];
4-162 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.68 DAPYS2
www.ti.com —
4.68 DAPYS2
4-Way SIMD Apply Sign Bits to Operand
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
5 5 5 7
Description The DAPYS2 instruction uses the sign-bits of the 16-bit packed quantities from dwop1
to conditionally negate the packed 16-bit quantities from xdwop2. If the sign-bit is set,
then the corresponding field in xdwop2 will be negated, otherwise it will pass through
unchanged. The boundary case of 0x8000 will saturate to 0x7FFF.
The DAPYS2 instruction can be used to return the absolute value of the packed 16-bit
quantities by sending the same data on to both operands. For example:
After execution of this instruction, A9:A8 will contain the four absolute values of the
contents of A1:A0
Execution if(sn(lsb16(src1_e)))
sat(-lsb16(src2_e)) -> lsb16(dst_e)
else
lsb16(src2_e) -> lsb16(dst_e)
if(sn(msb16(src1_e)))
sat(-msb16(src2_e)) -> msb16(dst_e)
else
msb16(src2_e) -> msb16(dst_e)
if(sn(lsb16(src1_o)))
sat(-lsb16(src2_o)) -> lsb16(dst_o)
else
lsb16(src2_o) -> lsb16(dst_o)
if(sn(msb16(src1_o)))
sat(-msb16(src2_o)) -> msb16(dst_o)
else
msb16(src2_o) -> msb16(dst_o)
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-163
Submit Documentation Feedback
4.68 DAPYS2
— www.ti.com
Delay Slots 0
Example A1 == 0x44444444
A0 == 0x7fffffff
A3 == 0xccccc444
A2 == 0x00000001
DAPYS2 .L A1:A0,A3:A2,A15:A14
A15 == 0xccccc444
A14 == 0x0000ffff
4-164 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.69 DAVG2
www.ti.com —
4.69 DAVG2
4-Way SIMD Average, Signed, Packed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
5 5 5 5
Description The DAVG2 instruction performs an averaging operation on packed 16-bit data. For
each pair of signed 16-bit values provided in dwop1 and xdwop2, DAVG2 calculates the
average of the two values as an signed 16-bit quantity in the corresponding position of
dwdst.
The averaging and rounding operation itself is performed by adding 1 to the sum of the
two 16 bit numbers. The result is then right-shifted by 1 to produce the final 16-bit
result.
63 48 47 32 31 16 15 0
i3 i2 i1 i0 ←op1
j3 j2 j1 j0 ←xop2
v v v v
Delay Slots 1
Example A3 == 0xc001c001
A2 == 0x40004000
A1 == 0x3fff3fff
A0 == 0x40004000
DAVG2 .M A3:A2,A1:A0,A7:A6
A7 == 0x00000000
A6 == 0x40004000
A3 == 0x3fff3fff
A2 == 0xc000c000
A1 == 0x3fff3fff
A0 == 0xc000c000
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-165
Submit Documentation Feedback
4.69 DAVG2
— www.ti.com
DAVG2 .M A3:A2,A1:A0,A7:A6
A7 == 0x3fff3fff
A6 == 0xc000c000
4-166 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.70 DAVGNR2
www.ti.com —
4.70 DAVGNR2
4-Way SIMD Average Without Rounding, Signed Packed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
5 5 5 5
Description The DAVGNR2 instruction performs an averaging operation on packed 16-bit data.
For each pair of signed 16-bit values provided in dwop1 and xdwop2, DAVGNR2
calculates the average of the two values as an signed 16-bit quantity in the
corresponding position of dwdst.
The averaging operation itself is performed without rounding: first we compute the
sum of the two 16-bit numbers being averaged. The result is then right-shifted by 1 and
sign extended to produce a 16-bit result. The intermediate results are kept at full
precision internally, so that no overflow conditions exist.
63 48 47 32 31 16 15 0
i3 i2 i1 i0 ←op1
j3 j2 j1 j0 ←xop2
v v v v
Delay Slots 1
Example A3 == 0xc001c001
A2 == 0x40004000
A1 == 0x3fff3fff
A0 == 0x40004000
DAVGNR2 .M A3:A2,A1:A0,A7:A6
A7 == 0x00000000
A6 == 0x40004000
A3 == 0x3fff3fff
A2 == 0xc000c000
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-167
Submit Documentation Feedback
4.70 DAVGNR2
— www.ti.com
A1 == 0x3fff3fff
A0 == 0xc000c000
DAVGNR2 .M A3:A2,A1:A0,A7:A6
A7 == 0x3fff3fff
A6 == 0xc000c000
4-168 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.71 DAVGNRU4
www.ti.com —
4.71 DAVGNRU4
8-Way SIMD Average Without Rounding, Unsigned Packed 8-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
5 5 5 5
For each pair of 8-bit quantities in dwop1 and xdwop2, the average of the unsigned 8-bit
value from dwop1 and the unsigned 8-bit value from xdwop2 is calculated to produce
an unsigned 8-bit result. The result is placed in the corresponding position in dwdst.
The averaging operation is performed without rounding -- the two numbers are merely
added together and the result shifted right by 1.
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
i7 i6 i5 i4 i3 i2 i1 i0 ←dwop1
j7 j6 j5 j4 j3 j2 j1 j0 ←xdwop2
v v v v
(i7+j7) >>1 (i6+j6) >>1 (i5+j5) >>1 (i4+j4) >>1 (i3+j3) >>1 : (i2+j2) >>1 (i1+j1) >>1 i0+j0) >>1 ←dwdst
Delay Slots 1
Example A3 == 0x1A2E5F4E
A2 == 0xFBFCFDFE
A1 == 0x9EF26E3F
A0 == 0x03020201
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-169
Submit Documentation Feedback
4.71 DAVGNRU4
— www.ti.com
DAVGNRU4 .M A3:A2,A1:A0,A15:A14
A15 == 0x5C906646
A14 == 0x7f7f7f7f
4-170 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.72 DAVGU4
www.ti.com —
4.72 DAVGU4
8-Way SIMD Average, Unsigned Packed 8-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
5 5 5 5
Description The DAVGU4 instruction performs an averaging operation between unsigned packed
8-bit quantities and rounding. The values in dwop1 and xdwop2 are treated as unsigned
packed 8-bit quantities, and the results are written in an unsigned packed 8-bit format.
For each pair of 8-bit quantities in dwop1 and xdwop2, the average of the unsigned 8-bit
value from dwop1 and the unsigned 8-bit value from xdwop2 is calculated to produce
an unsigned 8-bit result. The result is placed in the corresponding position in dwdst.
The averaging and rounding operation itself is performed by adding 1 to the sum of the
two unsigned 8-bit numbers being averaged. The result is then right-shifted by 1 to
produce a the final unsigned 8-bit result. The intermediate results are kept at full
precision internally, so that no overflow conditions exist.
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
i7 i6 i5 i4 i3 i2 i1 i0 ←dwop1
j7 j6 j5 j4 j3 j2 j1 j0 ←xdwop2
v v v v
Delay Slots 1
Functional Unit Latency 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-171
Submit Documentation Feedback
4.72 DAVGU4
— www.ti.com
Example A3 == 0x1A2E5F4E
A2 == 0xFBFCFDFE
A1 == 0x9EF26E3F
A0 == 0x03020201
DAVGU4 .M A3:A2,A1:A0,A15:A14
A15 == 0x5C906747
A14 == 0x7f7f8080
4-172 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.73 DCCMPY
www.ti.com —
4.73 DCCMPY
2-Way SIMD Complex Multiply With Conjugate, Packed Complex Signed 16-bit
(16-bit Real/16-bit Imaginary)
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
5 5 5 5
Description This instruction performs two complex multiplies on two pairs of packed complex
numbers (16-bit real, 16-bit imaginary). It first performs complex conjugate operation
on the two complex numbers on the 2nd operand, then it multiplies two complex
numbers in the first register pairs with the corresponding two complex numbers in the
second input register pair. The final results are 32-bit complex number (32-bit real,
32-bit imaginary) and placed in the destination register pair.
Delay Slots 3
Example A5 == 0x0009FFFE
A4 == 0x0009FFFE
A7 == 0xFFFF0007
A6 == 0xFFFF0007
DCCMPY .M A5:A4,A7:A6,A15:A14:A13:A12
A15 == 0xFFFFFFE9
A14 == 0xFFFFFFC3
A13 == 0xFFFFFFE9
A12 == 0xFFFFFFC3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-173
Submit Documentation Feedback
4.74 DCCMPYR1
— www.ti.com
4.74 DCCMPYR1
2-Way SIMD Complex Multiply With Conjugate and Rounding, Packed Complex
16-bit (16-bit Real/16-bit Imaginary)
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
5 5 5 5
Description This instruction performs two complex multiplies on two pairs of packed 16-bit
complex numbers. It first performs complex conjugate operation on the two 16-bit
complex numbers on the 2nd operand, then it multiplies two complex numbers in the
first register pairs with the corresponding two complex numbers in the second input
register pair. The final results are 16-bit complex number and placed in the destination
register pair.
Delay Slots 3
CSR= 0x00000000
4-174 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.74 DCCMPYR1
www.ti.com —
CSR == 0x00000000 ; 4
A3 == 0x80008000
A2 == 0x80008000
A1 == 0x80007fff
A0 == 0x80007fff
DCCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00017fff
A14 == 0x00017fff
CSR= 0x00000200
CSR == 0x00000000 ; 4
A3 == 0x80008000
A2 == 0x80008000
A1 == 0x80008000
A0 == 0x80008000
DCCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0x7fff0000
A14 == 0x7fff0000
CSR= 0x00000200
CSR == 0x00000000 ; 4
A3 == 0x08000400
A2 == 0x09000200
A1 == 0x09000200
A0 == 0x08000400
DCCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00a00028
A14 == 0x00a0ffd8
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x7FFF7FFF
A2 == 0x7FFF8000
A1 == 0x7FFF8000
A0 == 0x7FFF7FFF
DCCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0xffff7fff
A14 == 0xffff8000
CSR= 0x00000200
CSR == 0x00000000 ; 4
A3 == 0xFFFFFFFF
A2 == 0xFFFFFFFF
A1 == 0xFFFFFFFF
A0 == 0xFFFFFFFF
DCCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00000000
A14 == 0x00000000
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x55555555
A2 == 0x55555555
A1 == 0x55555555
A0 == 0x55555555
DCCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0x71c60000
A14 == 0x71c60000
CSR= 0x00000000
CSR == 0x00000000 ; 4
A3 == 0x01234567
A2 == 0x89ABCDEF
A1 == 0x89ABCDEF
A0 == 0x01234567
DCCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0xe3cec049
A14 == 0xe3ce3fb7
CSR= 0x00000000
CSR == 0x00000000 ; 4
B3 == 0x80008000
B2 == 0x80008000
B1 == 0x80007fff
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-175
Submit Documentation Feedback
4.74 DCCMPYR1
— www.ti.com
B0 == 0x80007fff
DCCMPYR1 .M B3:B2,B1:B0,B15:B14 ; added for a & b_path cov.
B15 == 0x00017fff
B14 == 0x00017fff
CSR= 0x00000200
CSR == 0x00000000 ; 4
B3 == 0x80008000
B2 == 0x80008000
B1 == 0x80008000
B0 == 0x80008000
DCCMPYR1 .M B3:B2,B1:B0,B15:B14
B15 == 0x7fff0000
B14 == 0x7fff0000
CSR= 0x00000200
4-176 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.75 DCMPEQ2
www.ti.com —
4.75 DCMPEQ2
2-Way SIMD Compare If Equal, Packed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
5 5 5 6
Description The DCMPEQ2 instruction performs equality comparisons on packed 16-bit data.
Each 16-bit value in op1 is compared against the corresponding 16-bit value in xop2,
returning a 1 if equal or 0 if not equal. The equality results are packed into the four
least-significant bits of dwdst.
63 48 47 32 31 16 15 0
a b c d ←dwop1
w x y z ←xdwop2
d == z
c == y
b == x
a == w
0 0 0 0 0 0 0 = = = = ←dst
31 3 2 1 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-177
Submit Documentation Feedback
4.75 DCMPEQ2
— www.ti.com
Delay Slots 0
Example A1 == 0x00000000
A0 == 0x3fff0000
A3 == 0x00000000
A2 == 0x00003fff
DCMPEQ2 .S A1:A0,A3:A2,A15
A15 == 0x0000000c
A1 == 0x3fffc000
A0 == 0x0000c001
A3 == 0x3fffc000
A2 == 0x3fffc001
DCMPEQ2 .S A1:A0,A3:A2,A15
A15 == 0x0000000d
A1 == 0xc0010000
A0 == 0x44444444
A3 == 0xc0010110
A2 == 0xcccc4444
DCMPEQ2 .S A1:A0,A3:A2,A15
A15 == 0x00000009
4-178 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.76 DCMPEQ4
www.ti.com —
4.76 DCMPEQ4
4-Way SIMD Compare If Equal, Packed 8-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 2 1 0
5 5 5 6 4
Description The DCMPEQ4 instruction performs equality comparisons on packed 8-bit data. Each
8-bit value in dwop1 is compared against the corresponding 8-bit value in xdwop2,
returning a 1 if equal or 0 if not equal. The equality results are packed into the eight
least-significant bits of dst.
The 8-bit values in each input are numbered from 0..7 starting with the least-significant
byte, working towards the most- significant byte. The comparison results for byte 0 are
written to bit 0 of the result. Likewise, the results for byte 1..7 are written to bits 1..7 of
the result, respectively, as shown in the diagram below.
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
a b c d e f g h ← op1
A B C D E F G H ← xop2
h == H
g == G
f == F
e == E
d == D
c == C
b == B
a == A
0 0 0 0 0 0 0 = = = = = = = = ← dst
31 8 7 6 5 4 3 2 1 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-179
Submit Documentation Feedback
4.76 DCMPEQ4
— www.ti.com
Delay Slots 0
Example A1 == 0xabffffff
A0 == 0xff00ff00
A3 == 0x00ffff00
A2 == 0x00ffff00
DCMPEQ4 .S A1:A0,A3:A2,A15
A15 == 0x00000063
A1 == 0xff0000ff
A0 == 0x023a4e1f
A3 == 0x002e3aff
A2 == 0x023b4e1f
DCMPEQ4 .S A1:A0,A3:A2,A15
A15 == 0x0000001b
A1 == 0x44444444
A0 == 0xff918aee
A3 == 0xccccc444
A2 == 0x01665a1e
DCMPEQ4 .S A1:A0,A3:A2,A15
A15 == 0x00000010
4-180 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.77 DCMPGT2
www.ti.com —
4.77 DCMPGT2
2-Way SIMD Compare If Greater-Than, Packed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 2 1 0
5 5 5 6 4
Description The DCMPGT2 instruction performs greater-than comparisons on packed 16-bit data.
Each signed 16-bit value in dwop1 is compared against the corresponding signed 16-bit
value in xdwop2, returning a 1 if the value from dwop1 is greater than the value from
xdwop2, or 0 otherwise. The comparison results are packed into the four
least-significant bits of dst.
31 16 15 0 31 16 15 0
a b c d ←dwop1
w x y z ←xdwop2
d>z
c>y
b>x
a>w
0 0 0 0 0 0 0 = = = = ←dst
31 3 2 1 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-181
Submit Documentation Feedback
4.77 DCMPGT2
— www.ti.com
Delay Slots 0
Example A1 == 0x7fff7fff
A0 == 0x7fff7fff
A3 == 0x7fff7fff
A2 == 0x7fff7fff
DCMPGT2 .S A1:A0,A3:A2,A11
A11 == 0x00000000
A1 == 0x7fff7fff
A0 == 0x7fff7fff
A3 == 0x80008000
A2 == 0x80008000
DCMPGT2 .S A1:A0,A3:A2,A11
A11 == 0x0000000f
A1 == 0x80007fff
A0 == 0x80007fff
A3 == 0x7fff8000
A2 == 0x7fff8000
DCMPGT2 .S A1:A0,A3:A2,A11
A11 == 0x00000005
A1 == 0x7fff8000
A0 == 0x7fff8000
A3 == 0x80007fff
A2 == 0x80007fff
DCMPGT2 .S A1:A0,A3:A2,A11
A11 == 0x0000000a
A1 == 0x80008000
A0 == 0x80008000
A3 == 0x7fff7fff
A2 == 0x7fff7fff
DCMPGT2 .S A1:A0,A3:A2,A11
A11 == 0x00000000
4-182 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.78 DCMPGTU4
www.ti.com —
4.78 DCMPGTU4
4-Way SIMD Compare If Greater-Than, Unsigned Packed 8-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 2 1 0
5 5 5 6 4
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
a b c d e f g h ← op1
A B C D E F G H ← xop2
h>H
g>G
f>F
e>E
d>D
c>C
b>B
a>A
0 0 0 0 0 0 0 = = = = = = = = ← dst
31 8 7 6 5 4 3 2 1 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-183
Submit Documentation Feedback
4.78 DCMPGTU4
— www.ti.com
Delay Slots 0
Example A1 == 0x000000ff
A0 == 0x000000ff
A3 == 0x000000fe
A2 == 0x000000fe
DCMPGTU4 .S A1:A0,A3:A2,A11
A11 == 0x00000011
A1 == 0x0000ffff
A0 == 0x0000ffff
A3 == 0x0000feff
A2 == 0x0000feff
DCMPGTU4 .S A1:A0,A3:A2,A11
A11 == 0x00000022
A1 == 0xffffffff
A0 == 0xffffffff
A3 == 0xfefffefe
A2 == 0xfefffefe
DCMPGTU4 .S A1:A0,A3:A2,A11
A11 == 0x000000bb
A1 == 0xffffffff
A0 == 0xffffffff
A3 == 0xfefefefe
A2 == 0xfefefefe
DCMPGTU4 .S A1:A0,A3:A2,A11
A11 == 0x000000ff
A1 == 0xabcdefac
A0 == 0xabcdefac
A3 == 0xbcdfaceb
A2 == 0xbcdfaceb
DCMPGTU4 .S A1:A0,A3:A2,A11
A11 == 0x00000022
4-184 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.79 DCMPY
www.ti.com —
4.79 DCMPY
2-Way SIMD Complex Multiply, Packed Complex 16-bit (16-bit Real/16-bit
Imaginary)
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 2 1 0
5 5 5 5 4
Description This instruction performs two complex multiplies on two pairs of packed complex
numbers.
The DCMPY instruction is functionally equivalent to following instruction sequence:
CMPY src1_h, src2_h, dst_3:dst_2
|| CMPY src1_l, src2_l, dst_1:dst_0
Delay Slots 3
Example A5 == 0x0009FFFE
A4 == 0x0009FFFE
A7 == 0xFFFF0007
A6 == 0xFFFF0007
DCMPY .M A5:A4,A7:A6,A15:A14:A13:A12
A15 == 0x00000005
A14 == 0x00000041
A13 == 0x00000005
A12 == 0x00000041
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-185
Submit Documentation Feedback
4.80 DCMPYR1
— www.ti.com
4.80 DCMPYR1
2-Way SIMD Complex Multiply With Rounding, Packed Complex 16-bit (16-bit
Real/16-bit Imaginary)
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 2 1 0
5 5 5 5 4
Description This instruction performs two complex multiplies on two pairs of packed complex
numbers.
Delay Slots 3
Example A3 == 0x00000000
A2 == 0x00000000
A1 == 0x00000000
A0 == 0x00000000
DCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00000000
A14 == 0x00000000
4-186 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.80 DCMPYR1
www.ti.com —
A3 == 0x80008000
A2 == 0x80008000
A1 == 0x80008000
A0 == 0x80008000
DCMPYR1 .M A3:A2,A1:A0,A15:A14 ; added for a & b_path cov.
A15 == 0x00007fff
A14 == 0x00007fff
A3 == 0x08000400
A2 == 0x09000200
A1 == 0x09000200
A0 == 0x08000400
DCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00800068
A14 == 0x00800068
A3 == 0x7FFF7FFF
A2 == 0x7FFF8000
A1 == 0x7FFF8000
A0 == 0x7FFF7FFF
DCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0x7fffffff
A14 == 0x7fffffff
A3 == 0xFFFFFFFF
A2 == 0xFFFFFFFF
A1 == 0xFFFFFFFF
A0 == 0xFFFFFFFF
DCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0x00000000
A14 == 0x00000000
A3 == 0x55555555
A2 == 0x55555555
A1 == 0x55555555
A0 == 0x55555555
DCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0x000071c6
A14 == 0x000071c6
A3 == 0x01234567
A2 == 0x89ABCDEF
A1 == 0x89ABCDEF
A0 == 0x01234567
DCMPYR1 .M A3:A2,A1:A0,A15:A14
A15 == 0x1a18bf65
A14 == 0x1a18bf65
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-187
Submit Documentation Feedback
4.81 DCROT270
— www.ti.com
4.81 DCROT270
2-Way SIMD Rotate Complex Number By 270 Degrees, Packed Complex 16-bit (16-bit
Real/16-bit Imaginary)
31 29 28 27 23 22 18 17 13 12 11 2 1 0
3 5 5 5 10
Description Performs two rotate by 270 degree operations on the input vector of complex numbers.
Equivalent to executing the CROT270 instruction twice.
is equivalent to executing
CROT270 A1, A3
|| CROT270 A0, A2
Execution if(cond) {
else nop
Delay Slots 0
Example A1 == 0x12345678
A0 == 0x80005532
DCROT270 .L A1:A0,A15:A14
A15 == 0x5678EDCC
A14 == 0x55327fff
4-188 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.82 DCROT90
www.ti.com —
4.82 DCROT90
2-Way SIMD Rotate Complex Number By 90 Degrees, Packed Complex 16-bit (16-bit
Real/16-bit Imaginary)
31 29 28 27 23 22 18 17 13 12 11 2 1 0
3 5 5 5 10
Description Performs two rotate by 90 degrees operations on the input vector of complex numbers.
Equivalent to executing the CROT90 instruction twice
is equivalent to executing
CROT90 A1, A3
|| CROT90 A0, A2
Execution if(cond) {
else nop
Delay Slots 0
Example A1 == 0x12345678
A0 == 0x55328000
DCROT90 .L A1:A0,A15:A14
A15 == 0xA9881234
A14 == 0x7fff5532
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-189
Submit Documentation Feedback
4.83 DDOTP4
— www.ti.com
4.83 DDOTP4
Double Dot Product, Signed, Packed 16-Bit and Signed, Packed 8-Bit
Compatibility
Opcode
31 30 29 28 27 23 22 18
0 0 0 1 dst src2
5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 0 0 0 1 1 0 0 s p
5 1 1 1
The lower byte of the lower halfword of src2 is sign-extended to 16 bits and multiplied
by the lower halfword of src1. The upper byte of the lower halfword of src2 is
sign-extended to 16 bits and multiplied by the upper halfword of src1. The two
products are added together and the result is then written to dst_e.
The lower byte of the upper halfword of src2 is sign-extended to 16 bits and multiplied
by the lower halfword of src1. The upper byte of the upper halfword of src2 is
sign-extended to 16 bits and multiplied by the upper halfword of src1. The two
products are added together and the result is then written to dst_o.
4-190 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.83 DDOTP4
www.ti.com —
dst_o dst_e
d1 x c3 + d0 x c2 d1 x c1 + d0 x c0
Examples Example 1
Example 2
DDOTP4 .M1X A4,B5,A9:A8
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-191
Submit Documentation Feedback
4.84 DDOTP4H
— www.ti.com
4.84 DDOTP4H
2-Way SIMD Dot Product, Signed by Signed Packed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 2 1 0
5 5 5 5 4
Description The DDOTP4H instruction returns two dot-products between four sets of packed
16-bit values. The values in dwop1 and xdwop2 are treated as signed packed 16-bit
quantities
For each pair of 16-bit quantities in the low 2 words of qwop1 and qwop2, the signed
16-bit value from qwop1 is multiplied with the signed 16-bit value from qwop2. The
four products are summed together, and the resulting dot product is written to the low
32-bits of dwdst.
And for each pair of 16-bit quantities in the high 2 words of qwop1 and qwop2, the
signed 16-bit value from qwop1 is multiplied with the signed 16-bit value from qwop2.
The four products are summed together, and the resulting dot product is written to the
high 32-bits of dwdst.
The result of each dot product is saturated to 32-bits, and the sat bits are set in CSR and
SSR
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
dotp4h dotp4h
b_7 b_6 b_5 b_4 b_3 b_2 b_1 b_0 ←qwop2
= =
a_7 * b_7 + a_6 * b_6 + a_5 * b_5 + a_4 * b_4 a_3 * b_3 + a_2 * b_2 + a_1 * b_1 + a_0 * b_0 ←dwdst
4-192 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.84 DDOTP4H
www.ti.com —
Execution TBD
See Also
Example A3 == 0x321089AB
A2 == 0x321089AB
A1 == 0x321089AB
A0 == 0x321089AB
A7 == 0x87654321
A6 == 0x87654321
A5 == 0x87654321
A4 == 0x87654321
DDOTP4H .M A3:A2:A1:A0,A7:A6:A5:A4,A15:A14
A15 <== 0x92c560b6
A14 <== 0x92c560b6
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-193
Submit Documentation Feedback
4.85 DDOTPH2
— www.ti.com
4.85 DDOTPH2
Double Dot Product, Two Pairs, Signed, Packed 16-Bit
Compatibility
Opcode
31 30 29 28 27 23 22 18
0 0 0 1 dst src2
5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 1 1 1 1 0 0 s p
5 1 1 1
Description Returns two dot-products between two pairs of signed, packed 16-bit values. The values
in src1_e, src1_o, and src2 are treated as signed, packed 16-bit quantities. The signed
results are written to a 64-bit register pair.
The product of the lower halfwords of src1_o and src2 is added to the product of the
upper halfwords of src1_o and src2. The result is then written to dst_o.
The product of the upper halfword of src2 and the lower halfword of src1_o is added to
the product of the lower halfword of src2 and the upper halfword of src1_e. The result
is then written to dst_e.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written
one cycle after the results are written to dst_o:dst_e.
4-194 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.85 DDOTPH2
www.ti.com —
32 32
dst_o dst_e
d3 x c1 + d2 x c0 d2 x c1 + d1 x c0
Delay Slots 3
Examples Example 1
DDOTPH2 .M1 A5:A4,A6,A9:A8
Example 2
A6 8000 8000h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-195
Submit Documentation Feedback
4.85 DDOTPH2
— www.ti.com
Example 3
A6 340B F73Bh
4-196 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.86 DDOTPH2R
www.ti.com —
4.86 DDOTPH2R
Double Dot Product With Rounding, Two Pairs, Signed, Packed 16-Bit
Compatibility
Opcode
31 30 29 28 27 23 22 18
0 0 0 1 dst src2
5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 0 1 1 1 0 0 s p
5 1 1 1
Description Returns two dot-products between two pairs of signed, packed 16-bit values. The values
in src1_e, src1_o, and src2 are treated as signed, packed 16-bit quantities. The signed
results are rounded, shifted right by 16 and packed into a 32-bit register.
The product of the lower halfwords of src1_o and src2 is added to the product of the
upper halfwords of src1_o and src2. The result is rounded by adding 215 to it and
saturated if appropriate. The 16 most-significant bits of the result are written to the 16
most-significant bits of dst.
The product of the upper halfword of src2 and the lower halfword of src1_o is added to
the product of the lower halfword of src2 and the upper halfword of src1_e. The result
is rounded by adding 215 to it and saturated if appropriate. The 16 most-significant bits
of the result are written to the 16 least-significant bits of dst.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written
one cycle after the results are written to dst.
msb16(sat((lsb16(src1_o) × msb16(src2)) +
(msb16(src1_e) × lsb16(src2)) + 00008000h)) → lsb16(dst)
Delay Slots 3
See Also DDOTPH2, DDOTPL2, DDOTPL2R
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-197
Submit Documentation Feedback
4.86 DDOTPH2R
— www.ti.com
A5 BBAE D169h
A6 340B F73Bh
Example 2
DDOTPH2R .M1 A5:A4,A6,A8
A5 1234 8000h
A6 8000 8001h
Example 3
B5 8000 8000h
B6 8000 8001h
4-198 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.87 DDOTPL2
www.ti.com —
4.87 DDOTPL2
Double Dot Product, Two Pairs, Signed, Packed 16-Bit
Compatibility
Opcode
31 30 29 28 27 23 22 18
0 0 0 1 dst src2
5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 1 0 1 1 0 0 s p
5 1 1 1
Description Returns two dot-products between two pairs of signed, packed 16-bit values. The values
in src1_e, src1_o, and src2 are treated as signed, packed 16-bit quantities. The signed
results are written to a 64-bit register pair.
The product of the lower halfwords of src1_e and src2 is added to the product of the
upper halfwords of src1_e and src2. The result is then written to dst_e.
The product of the upper halfword of src2 and the lower halfword of src1_o is added to
the product of the lower halfword of src2 and the upper halfword of src1_e. The result
is then written to dst_o.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written
one cycle after the results are written to dst_o:dst_e.
src1_o src1_e src2
d3 d2 d1 d0 c1 c0
MSB16 LSB16 MSB16 LSB16 MSB16 LSB16
32 32
dst_o dst_e
d2 x c1 + d1 x c0 d1 x c1 + d0 x c0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-199
Submit Documentation Feedback
4.87 DDOTPL2
— www.ti.com
Delay Slots 3
Examples Example 1
Example 2
A6 340B F73Bh
Example 3
A6 8000 8000h
4-200 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.87 DDOTPL2
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-201
Submit Documentation Feedback
4.88 DDOTPL2R
— www.ti.com
4.88 DDOTPL2R
Double Dot Product With Rounding, Two Pairs, Signed Packed 16-Bit
Compatibility
Opcode
31 30 29 28 27 23 22 18
0 0 0 1 dst src2
5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 0 0 1 1 0 0 s p
5 1 1 1
Description Returns two dot-products between two pairs of signed, packed 16-bit values. The values
in src1_e, src1_o, and src2 are treated as signed, packed 16-bit quantities. The signed
results are rounded, shifted right by 16 and packed into a 32-bit register.
The product of the lower halfwords of src1_e and src2 is added to the product of the
upper halfwords of src1_e and src2. The result is rounded by adding 215 to it and
saturated if appropriate. The 16 most-significant bits of the result are written to the 16
least-significant bits of dst.
The product of the upper halfword of src2 and the lower halfword of src1_o is added to
the product of the lower halfword of src2 and the upper halfword of src1_e. The result
is rounded by adding 215 to it and saturated if appropriate. The 16 most-significant bits
of the result are written to the 16 most-significant bits of dst.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written
one cycle after the results are written to dst.
msb16(sat((lsb16(src1_o) × msb16(src2)) +
(msb16(src1_e) × lsb16(src2)) + 00008000h)) → msb16(dst)
Delay Slots 3
Examples Example 1
4-202 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.88 DDOTPL2R
www.ti.com —
A5 BBAE D169h
A6 340B F73Bh
Example 2
A5 1234 8000h
A6 8000 8001h
Example 3
B5 8000 8000h
B6 8000 8001h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-203
Submit Documentation Feedback
4.89 DDOTPSU4H
— www.ti.com
4.89 DDOTPSU4H
2-Way SIMD Dot Product, Signed By Unsigned Packed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
5 5 5 5
Description The DDOTPSU4H instruction returns two dot-products between four sets of packed
16-bit values.
The values in qwop1 are treated as signed packed 16-bit quantities, while the values in
qwop2 are treated as unsigned 16-bit quantities.
For each pair of 16-bit quantities in the low 2 words of qwop1 and qwop2, the signed
16-bit value from qwop1 is multiplied with the unsigned 16-bit value from qwop2. The
four products are summed together, and the resulting dot product is written to the low
32-bits of dwdst.
And for each pair of 16-bit quantities in the high 2 words of qwop1 and qwop2, the
signed 16-bit value from qwop1 is multiplied with the unsigned 16-bit value from
qwop2. The four products are summed together, and the resulting dot product is
written to the high 32-bits of dwdst.
The result of each dot product is saturated to 32-bits, and the sat bits are set in CSR and
SSR
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
a_7 a_6 a_5 a_4 a_3 a_2 a_1 a_0 ←qwop1
dotpsu4h dotpsu4h
a_7 * b_7 + a_6 * b_6 + a_5 * b_5 + a_4 * b_4 a_3 * b_3 + a_2 * b_2 + a_1 * b_1 + a_0 * b_0 ←dwdst
4-204 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.89 DDOTPSU4H
www.ti.com —
Execution TBD
Delay Slots 3
CSR<== 0x10010200
CSR == 0x00000000
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-205
Submit Documentation Feedback
4.90 DEAL
— www.ti.com
4.90 DEAL
Deinterleave and Pack
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 0 1 x 0 0 0 0 1 1 1 1 0 0 s p
1 1 1
Description Performs a deinterleave and pack operation on the bits in src2. The odd and even bits
of src2 are extracted into two separate, 16-bit quantities. These 16-bit quantities are
then packed such that the even bits are placed in the lower halfword, and the odd bits
are placed in the upper halfword.
As a result, bits 0, 2, 4, ... , 28, 30 of src2 are placed in bits 0, 1, 2, ... , 14, 15 of dst.
Likewise, bits 1, 3, 5, ... , 29, 31 of src2 are placed in bits 16, 17, 18, ... , 30, 31 of dst.
31 0
aAbBcCdD eEfFgGhHiIjJkKlLmMnNoOpP ←src2
DEAL
↓ ↓
31 0
abcdefgh ijklmnop ABCDEFGH IJKLMNOP ←dst
src230,28,26...0 → dst15,14,13...0
4-206 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.90 DEAL
www.ti.com —
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-207
Submit Documentation Feedback
4.91 DINT
— www.ti.com
4.91 DINT
Disable Interrupts and Save Previous Enable State
Syntax DINT
unit = none
Compatibility
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description Disables interrupts in the current cycle, copies the contents of the GIE bit in TSR into
the SGIE bit in TSR, and clears the GIE bit in both TSR and CSR. The PGIE bit in CSR
is unchanged.
The CPU will not service a maskable interrupt in the cycle immediately following the
DINT instruction. This behavior differs from writes to GIE using the MVC instruction.
See section 5.2 for details.
The DINT instruction cannot be placed in parallel with the following instructions:
MVC reg, TSR; MVC reg, CSR; B IRP; B NRP; NOP n; RINT; SPKERNEL;
SPKERNELR; SPLOOP; SPLOOPD; SPLOOPW; SPMASK; or SPMASKR.
Note—The use of the DINT and RINT instructions in a nested manner, like the
following code:
DINT
DINT
RINT
RINT
leaves interrupts disabled. The first DINT leaves TSR.GIE cleared to 0, so the
second DINT leaves TSR,.SGIE cleared to 0. The RINT instructions, therefore,
copy zero to TSR.GIE (leaving interrupts disabled).
Execution Disable interrupts in current cycle
Delay Slots 0
4-208 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.92 DINTHSP
www.ti.com —
4.92 DINTHSP
2-Way SIMD Convert 16-bit Signed Integer to Single Precision Floating Point
Syntax
Mnemonic Unit Operand
DINTHSP L,S1 or L,S2 xop,dwdst
31 29 28 27 23 22 18 17 13 12 11 2 1 0
3 5 5 5 10
31 29 28 27 23 22 18 17 13 12 11 2 1 0
3 5 5 5 10
Description The signed, packed 16-bit, values in src2 are converted to single-precision floating
point values and placed in dst_e and dst_o.
Execution if(cond) {
else nop
Delay Slots 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-209
Submit Documentation Feedback
4.92 DINTHSP
— www.ti.com
FADCR= 0x00000000
;RMODE(0);[6501]->[6501];[4391]->[4391];RMODE(0);
FADCR== 0x00000000
A2 == 0xffffffde
DINTHSP .L A2,A1:A0
A1 == 0xbf800000
A0 == 0xc2080000
FADCR== 0x00000000
;RMODE(0);[-1]->[-1];[-34]->[-34];RMODE(0)
4-210 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.93 DINTHSP
www.ti.com —
4.93 DINTHSP
2-Way SIMD Convert 32-bit Signed Integer to Single Precision Floating Point, Packed
Signed 32-bit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
3 5 5 5
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src opfield x 1 1 1 1 0 0 1 0 0 0 s p
3 5 5 5
Description The signed, packed 32-bit, values in src2 are converted to single-precision floating
point values and placed in dst_e and dst_o.
Execution if(cond) {
sp(src2_e) -> dst_e
sp(src2_o) -> dst_o
}
else nop
Delay Slots 2
Example
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-211
Submit Documentation Feedback
4.94 DINTHSPU
— www.ti.com
4.94 DINTHSPU
2-Way SIMD Convert 16-bit Unsigned Integer to Single Precision Floating Point
31 29 28 27 23 22 18 17 13 12 11 2 1 0
3 5 5 5 10
31 29 28 27 23 22 18 17 13 12 11 2 1 0
Description The unsigned, packed 16-bit, values in src2 are converted to single-precision floating
point values and placed in dst_h and dst_l.
Execution if(cond) {
sp(ulsb16(src2)) -> dst_e
sp(umsb16(src2)) -> dst_o
else nop
Delay Slots 2
4-212 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.94 DINTHSPU
www.ti.com —
FADCR= 0x00000000
;RMODE(0);[6501]->[6501];[4391]->[4391];RMODE(0);
FADCR== 0x00000000
A2 == 0xffffffde
DINTHSPU .L A2,A0
A0 == 0x477FFF00
== 0x477FDE00
FADCR= 0x00000000
;RMODE(0);[2^16-1]->[2^16-1];[2^16-34]->[2^16-34];RMODE(0);
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-213
Submit Documentation Feedback
4.95 DINTSPU
— www.ti.com
4.95 DINTSPU
2-Way SIMD Convert 32-bit Unsigned Integer to Single Precision Floating Point,
Packed Unsigned 32-bit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src opfield x 1 1 1 1 0 0 1 0 0 0 s p
3 5 5 5
Description The unsigned, packed 32-bit, values in src2 are converted to single-precision floating
point values and placed in dst_e and dst_o.
Execution if(cond) {
sp(src2_e) -> dst_e
sp(src2_o) -> dst_o
}
else nop
Delay Slots 2
Example
4-214 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.96 DMAX2
www.ti.com —
4.96 DMAX2
2-Way SIMD Maximum, Packed Signed 16-bit
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 2 1 0
5 5 5 7 3
Description The DMAX2 performs four maximum operations on packed signed 16-bit values. For
each pair of signed 16-bit values in dwop1 and xdwop2, DMAX2 places the larger value
in the corresponding position in dwdst.
Delay Slots 0
Example A1 == 0x80007fff
A0 == 0x80007fff
A3 == 0x00008000
A2 == 0x00008000
DMAX2 .L A1:A0,A3:A2,A5:A4
A5 == 0x00007fff
A4 == 0x00007fff
A1 == 0x11118001
A0 == 0x11118001
A3 == 0x22228003
A2 == 0x22228003
DMAX2 .L A1:A0,A3:A2,A5:A4
A5 == 0x22228003
A4 == 0x22228003
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-215
Submit Documentation Feedback
4.96 DMAX2
— www.ti.com
A1 == 0xffffffff
A0 == 0xffffffff
A3 == 0x7fffffff
A2 == 0x7fffffff
DMAX2 .L A1:A0,A3:A2,A5:A4
A5 == 0x7fffffff
A4 == 0x7fffffff
A1 == 0x7ffffffe
A0 == 0x7ffffffe
A3 == 0x8000ffff
A2 == 0x8000ffff
DMAX2 .L A1:A0,A3:A2,A5:A4
A5 == 0x7fffffff
A4 == 0x7fffffff
A1 == 0x7fffffff
A0 == 0x7fffffff
A3 == 0x7ffffffe
A2 == 0x7ffffffe
DMAX2 .L A1:A0,A3:A2,A5:A4
A5 == 0x7fffffff
A4 == 0x7fffffff
A1 == 0xfffeffff
A0 == 0xfffeffff
A3 == 0x7fff7ffe
A2 == 0x7fff7ffe
DMAX2 .L A1:A0,A3:A2,A5:A4
A5 == 0x7fff7ffe
A4 == 0x7fff7ffe
A1 == 0x43211234
A0 == 0x43211234
A3 == 0x23411324
A2 == 0x23411324
DMAX2 .L A1:A0,A3:A2,A5:A4
A5 == 0x43211324
A4 == 0x43211324
4-216 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.97 DMAXU4
www.ti.com —
4.97 DMAXU4
4-Way SIMD Maximum, Packed Unsigned 8-bit
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 2 1 0
5 5 5 7 3
Description The DMAXU4 performs eight maximum operations on packed unsigned 8-bit values.
For each pair of unsigned 8-bit values in dwop1 and xdwop2, MAXU4 places the larger
value in the corresponding position in dwdst.
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-217
Submit Documentation Feedback
4.97 DMAXU4
— www.ti.com
Example A3 == 0xbabebeef
A2 == 0xbabebeef
A1 == 0xbeefbabe
A0 == 0xbeefbabe
DMAXU4 .L A3:A2,A1:A0,A5:A4
A5 == 0xbeefbeef
A4 == 0xbeefbeef
A3 == 0x11223344
A2 == 0x11223344
A1 == 0x44332211
A0 == 0x44332211
DMAXU4 .L A3:A2,A1:A0,A5:A4
A5 == 0x44333344
A4 == 0x44333344
A3 == 0x807f0040
A2 == 0x807f0040
A1 == 0x7f801180
A0 == 0x7f801180
DMAXU4 .L A3:A2,A1:A0,A5:A4
A5 == 0x80801180
A4 == 0x80801180
A3 == 0xfffefffe
A2 == 0xfffefffe
A1 == 0xfefffeff
A0 == 0xfefffeff
DMAXU4 .L A3:A2,A1:A0,A5:A4
A5 == 0xffffffff
A4 == 0xffffffff
A3 == 0x00000000
A2 == 0x00000000
A1 == 0xffffffff
A0 == 0xffffffff
DMAXU4 .L A3:A2,A1:A0,A5:A4
A5 == 0xffffffff
A4 == 0xffffffff
A3 == 0x00ff00ff
A2 == 0x00ff00ff
A1 == 0xff00ff00
A0 == 0xff00ff00
DMAXU4 .L A3:A2,A1:A0,A5:A4
A5 == 0xffffffff
A4 == 0xffffffff
A3 == 0xabcdefad
A2 == 0xabcdefad
A1 == 0xbadcfeda
A0 == 0xbadcfeda
DMAXU4 .L A3:A2,A1:A0,A5:A4
A5 == 0xbadcfeda
A4 == 0xbadcfeda
A3 == 0x43211234
A2 == 0x43211234
A1 == 0x23411324
A0 == 0x23411324
DMAXU4 .L A3:A2,A1:A0,A5:A4
A5 == 0x43411334
A4 == 0x43411334
4-218 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.98 DMIN2
www.ti.com —
4.98 DMIN2
2-Way SIMD Minimum, Packed Signed 16-bit
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 2 1 0
5 5 5 7 3
Description The DMIN2 performs four minimum operations on packed signed 16-bit values. For
each pair of signed 16-bit values in dwop1 and xdwop2, DMIN2 places the smaller value
in the corresponding position in dwdst.
Delay Slots 0
Example A3 == 0x80007fff
A2 == 0x80007fff
A1 == 0x00008000
A0 == 0x00008000
DMIN2 .L A3:A2,A1:A0,A5:A4
A5 == 0x80008000
A4 == 0x80008000
A3 == 0x11118001
A2 == 0x11118001
A1 == 0x22228003
A0 == 0x22228003
DMIN2 .L A3:A2,A1:A0,A5:A4
A5 == 0x11118001
A4 == 0x11118001
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-219
Submit Documentation Feedback
4.98 DMIN2
— www.ti.com
A3 == 0xffffffff
A2 == 0xffffffff
A1 == 0x7fffffff
A0 == 0x7fffffff
DMIN2 .L A3:A2,A1:A0,A5:A4
A5 == 0xffffffff
A4 == 0xffffffff
A3 == 0x7ffffffe
A2 == 0x7ffffffe
A1 == 0x8000ffff
A0 == 0x8000ffff
DMIN2 .L A3:A2,A1:A0,A5:A4
A5 == 0x8000fffe
A4 == 0x8000fffe
A3 == 0x7fffffff
A2 == 0x7fffffff
A1 == 0x7ffffffe
A0 == 0x7ffffffe
DMIN2 .L A3:A2,A1:A0,A5:A4
A5 == 0x7ffffffe
A4 == 0x7ffffffe
A3 == 0xfffeffff
A2 == 0xfffeffff
A1 == 0x7fff7ffe
A0 == 0x7fff7ffe
DMIN2 .L A3:A2,A1:A0,A5:A4
A5 == 0xfffeffff
A4 == 0xfffeffff
A3 == 0x43211234
A2 == 0x43211234
A1 == 0x23411324
A0 == 0x23411324
DMIN2 .L A3:A2,A1:A0,A5:A4
A5 == 0x23411234
A4 == 0x23411234
4-220 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.99 DMINU4
www.ti.com —
4.99 DMINU4
4-Way SIMD Minimum, Packed Unisgned 8-bit
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 2 1 0
5 5 5 7 3
Description The DMINU4 performs eight minimum operations on packed unsigned 8-bit values.
For each pair of unsigned 8-bit values in dwop1 and xdwop2, DMINU4 places the
smaller value in the corresponding position in dwdst.
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-221
Submit Documentation Feedback
4.99 DMINU4
— www.ti.com
Example A1 == 0xfffefffe
A0 == 0xfffefffe
A3 == 0xfefffeff
A2 == 0xfefffeff
DMINU4 .L A1:A0,A3:A2,A9:A8
A9 == 0xfefefefe
A8 == 0xfefefefe
A1 == 0x00000000
A0 == 0x00000000
A3 == 0xffffffff
A2 == 0xffffffff
DMINU4 .L A1:A0,A3:A2,A9:A8
A9 == 0x00000000
A8 == 0x00000000
A1 == 0x00ff00ff
A0 == 0x00ff00ff
A3 == 0xff00ff00
A2 == 0xff00ff00
DMINU4 .L A1:A0,A3:A2,A9:A8
A9 == 0x00000000
A8 == 0x00000000
A1 == 0xabcdefad
A0 == 0xabcdefad
A3 == 0xbadcfeda
A2 == 0xbadcfeda
DMINU4 .L A1:A0,A3:A2,A9:A8
A9 == 0xabcdefad
A8 == 0xabcdefad
A1 == 0x43211234
A0 == 0x43211234
A3 == 0x23411324
A2 == 0x23411324
DMINU4 .L A1:A0,A3:A2,A9:A8
A9 == 0x23211224
A8 == 0x23211224
A1 == 0x77665544
A0 == 0x77665544
A3 == 0x11223344
A2 == 0x11223344
DMINU4 .L A1:A0,A3:A2,A9:A8
A9 == 0x11223344
A8 == 0x11223344
A1 == 0x33445566
A0 == 0x33445566
A3 == 0x77665544
A2 == 0x77665544
DMINU4 .L A1:A0,A3:A2,A9:A8
A9 == 0x33445544
A8 == 0x33445544
4-222 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.100 DMPY2
www.ti.com —
4.100 DMPY2
4-Way SIMD Multiply, Packed Signed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 2 1 0
5 5 5 5 4
Description The DMPY2 instruction performs four 16-bit multiplications between signed packed
16-bit quantities. The values in dwop1 and xdwop2 are treated as signed packed 16-bit
quantities. The 32-bit results are placed in a 128-bit register quad.
Delay Slots 3
Example A5 == 0x12343497
A4 == 0x6a321193
A9 == 0x21ff50a7
A8 == 0xb1746ca4
DMPY2 .M A5:A4,A9:A8,A3:A2:A1:A0
A3 == 0x026ad5cc
A2 == 0x10917e81
A1 == 0xdf6ab0a8
A0 == 0x0775462c
A5 == 0x7fff7fff
A4 == 0x80018001
A9 == 0x7fff7fff
A8 == 0x80018001
DMPY2 .M A5:A4,A9:A8,A3:A2:A1:A0
A3 == 0x3fff0001
A2 == 0x3fff0001
A1 == 0x3fff0001
A0 == 0x3fff0001
A5 == 0x7fff8001
A4 == 0x3ccdc333
A9 == 0x80017fff
A8 == 0xc333c333
DMPY2 .M A5:A4,A9:A8,A3:A2:A1:A0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-223
Submit Documentation Feedback
4.100 DMPY2
— www.ti.com
A3 == 0xc000ffff
A2 == 0xc000ffff
A1 == 0xf18f43d7
A0 == 0x0e70bc29
A5 == 0x87654321
A4 == 0x80008000
A9 == 0x321089ab
A8 == 0x80008000
DMPY2 .M A5:A4,A9:A8,A3:A2:A1:A0
A3 == 0xe86a3050
A2 == 0xe0f8800b
A1 == 0x40000000
A0 == 0x40000000
4-224 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.101 DMPYSP
www.ti.com —
4.101 DMPYSP
2-Way SIMD Multiply, Packed Single Precision Floating Point
31 30 29 28 27 23 22 18 17 13 12 11 7 6 2 1 0
5 5 5 5 5
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-225
Submit Documentation Feedback
4.101 DMPYSP
— www.ti.com
FMCR= 0x00000002
FMCR== 0x00000000
A5 == 0xbf800000
A4 == 0x3f800000
A9 == 0x7fc00000
A8 == 0x3f800000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0xffffffff
A0 == 0x3f800000
FMCR= 0x00000002
FMCR== 0x00000000
A5 == 0xbf800000
A4 == 0x3f800000
A9 == 0x7fc00000
A8 == 0x7fc00000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0xffffffff
A0 == 0x7fffffff
FMCR= 0x00000002
FMCR== 0x00000000
A5 == 0x3f800000
A4 == 0xffc00000
A9 == 0x3f800000
A8 == 0xbf800000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0x3f800000
A0 == 0x7fffffff
FMCR= 0x00000001
FMCR== 0x00000000
A5 == 0xffc00000
A4 == 0x3f800000
A9 == 0xbf800000
A8 == 0x3f800000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0x7fffffff
A0 == 0x3f800000
FMCR= 0x00000001
FMCR== 0x00000000
A5 == 0xffc00000
A4 == 0xffc00000
A9 == 0xbf800000
A8 == 0x3f800000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0x7fffffff
A0 == 0xffffffff
FMCR= 0x00000001
FMCR== 0x00000000
A5 == 0xbf800000
A4 == 0x3f800000
A9 == 0xbf800000
A8 == 0x3f800000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0x3f800000
A0 == 0x3f800000
FMCR= 0x00000000
FMCR== 0x00000000
A5 == 0x7f900000
A4 == 0x3f800000
A9 == 0x7f900000
A8 == 0x3f800000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0x7fffffff
A0 == 0x3f800000
FMCR= 0x00000013
4-226 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.101 DMPYSP
www.ti.com —
FMCR== 0x00000000
A5 == 0x3f800000
A4 == 0x7f900000
A9 == 0x7f900000
A8 == 0x3f800000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0x7fffffff
A0 == 0x7fffffff
FMCR= 0x00000013
FMCR== 0x00000000
A5 == 0x7f900000
A4 == 0x3f800000
A9 == 0x3f800000
A8 == 0x7f900000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0x7fffffff
A0 == 0x7fffffff
FMCR= 0x00000013
FMCR== 0x00000000
A5 == 0x7f900000
A4 == 0x7f900000
A9 == 0x7f900000
A8 == 0xff900000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0x7fffffff
A0 == 0xffffffff
FMCR= 0x00000013
FMCR== 0x00000000
A5 == 0x3356bf94
A4 == 0x3f800000
A9 == 0x43ff8000
A8 == 0x3f800000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0x37d65434
A0 == 0x3f800000
FMCR= 0x00000080
FMCR== 0x00000000
A5 == 0x3356bf94
A4 == 0x43ff8000
A9 == 0x43ff8000
A8 == 0x3356bf94
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0x37d65434
A0 == 0x37d65434
FMCR= 0x00000080
FMCR== 0x00000000
A5 == 0x3f800000
A4 == 0x3356bf94
A9 == 0x3f800000
A8 == 0x43ff8000
DMPYSP .M A5:A4,A9:A8,A1:A0
A1 == 0x3f800000
A0 == 0x37d65434
FMCR= 0x00000080
FMCR== 0x00000000
B5 == 0x3356bf94
B4 == 0x3f800000
B9 == 0x43ff8000
B8 == 0x3f800000
DMPYSP .M B5:B4,B9:B8,B1:B0
B1 == 0x37d65434
B0 == 0x3f800000
FMCR= 0x00800000
FMCR== 0x00000000
B5 == 0x3356bf94
B4 == 0x43ff8000
B9 == 0x43ff8000
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-227
Submit Documentation Feedback
4.101 DMPYSP
— www.ti.com
B8 == 0x3356bf94
DMPYSP .M B5:B4,B9:B8,B1:B0
B1 == 0x37d65434
B0 == 0x37d65434
FMCR= 0x00800000
FMCR== 0x00000000
B5 == 0x3f800000
B4 == 0x3356bf94
B9 == 0x3f800000
B8 == 0x43ff8000
DMPYSP .M B5:B4,B9:B8,B1:B0
B1 == 0x3f800000
B0 == 0x37d65434
FMCR= 0x00800000
FMCR== 0x00000000
B5 == 0x7fc00000
B4 == 0x7f900000
B9 == 0xbf800000
B8 == 0x3f800000
DMPYSP .M B5:B4,B9:B8,B1:B0
B1 == 0xffffffff
B0 == 0x7fffffff
FMCR= 0x00110000
FMCR== 0x00000000
B5 == 0xbf800000
B4 == 0x3f800000
B9 == 0x7fc00000
B8 == 0x7f900000
DMPYSP .M B5:B4,B9:B8,B1:B0
B1 == 0xffffffff
B0 == 0x7fffffff
FMCR= 0x00120000
FMCR== 0x00000000
B5 == 0x7fc00000
B4 == 0x7f900000
B9 == 0xff900000
B8 == 0x3f800000
DMPYSP .M B5:B4,B9:B8,B1:B0
B1 == 0xffffffff
B0 == 0x7fffffff
FMCR= 0x00130000
FMCR== 0x00000000
B5 == 0xff900000
B4 == 0x3f800000
B9 == 0x7fc00000
B8 == 0x7f900000
DMPYSP .M B5:B4,B9:B8,B1:B0
B1 == 0xffffffff
B0 == 0x7fffffff
FMCR= 0x00130000
FMCR== 0x00000000
B5 == 0x7fc00000
B4 == 0x3f800000
B9 == 0xbf800000
B8 == 0x7f900000
DMPYSP .M B5:B4,B9:B8,B1:B0
B1 == 0xffffffff
B0 == 0x7fffffff
FMCR= 0x00130000
FMCR== 0x00000000
B5 == 0x3f800000
B4 == 0x7fc00000
B9 == 0x7f900000
B8 == 0xbf800000
DMPYSP .M B5:B4,B9:B8,B1:B0
B1 == 0x7fffffff
B0 == 0xffffffff
4-228 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.101 DMPYSP
www.ti.com —
FMCR= 0x00130000
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-229
Submit Documentation Feedback
4.102 DMPYSU4
— www.ti.com
4.102 DMPYSU4
4-Way SIMD Multiply Signed By Unsigned, Packed 8-bit
31 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
3 5 5 5 5
Description For each 8-bit quantity in dwop1 and xdwop2, DMPYSU4 performs a signed 8-bit by
unsigned 8-bit multiply between the value from dwop1 and xdwop2, producing a
signed 16-bit result. The eight signed 16-bit results are packed into a 128-bit register
quad.
MYPSU4 MYPSU4
= =
a_7 * b_7 a_6 * b_6 a_5 * b_5 a_4 * b_4 a_3 * b_3 a_2 * b_2 a_1 * b_1 a_0 * b_0
Delay Slots 3
Example
4-230 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.103 DMPYU2
www.ti.com —
4.103 DMPYU2
4-Way SIMD Multiply Unisgned by Unsigned, Packed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 7 6 2 1 0
5 5 5 5 5
Description The DMPYU2 instruction performs four 16-bit multiplications between unsigned
packed 16-bit quantities. The values in dwop1 and xdwop2 are treated as unsigned
packed 16-bit quantities. The 32-bit results are placed in a 128-bit register quad.
A5 == 0x7fff7fff
A4 == 0x80018001
A9 == 0x7fff7fff
A8 == 0x80018001
DMPYU2 .M A5:A4,A9:A8,A3:A2:A1:A0
A3 == 0x3fff0001
A2 == 0x3fff0001
A1 == 0x40010001
A0 == 0x40010001
A5 == 0x7fff8001
A4 == 0x3ccdc333
A9 == 0x80017fff
A8 == 0xc333c333
DMPYU2 .M A5:A4,A9:A8,A3:A2:A1:A0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-231
Submit Documentation Feedback
4.103 DMPYU2
— www.ti.com
A3 == 0x3fffffff
A2 == 0x3fffffff
A1 == 0x2e5c43d7
A0 == 0x94d6bc29
A5 == 0x87654321
A4 == 0x80008000
A9 == 0x321089ab
A8 == 0x80008000
DMPYU2 .M A5:A4,A9:A8,A3:A2:A1:A0
A3 == 0x1a7a3050
A2 == 0x2419800b
A1 == 0x40000000
A0 == 0x40000000
4-232 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.104 DMPYU4
www.ti.com —
4.104 DMPYU4
4-Way SIMD Multiply Unsigned By Unsigned, Packed 8-bit
31 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
3 5 5 5 5
Description For each 8-bit quantity in dwop1 and xdwop2, DMPYU4 performs eight unsigned 8-bit
by unsigned 8-bit multiplies between the values from dwop1 and xdwop2, producing
eight signed 16-bit results packed into a 128-bit register quad.
MYPU4 MYPU4
= =
a_7 * b_7 a_6 * b_6 a_5 * b_5 a_4 * b_4 a_3 * b_3 a_2 * b_2 a_1 * b_1 a_0 * b_0
Delay Slots 3
Example
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-233
Submit Documentation Feedback
4.105 DMV
— www.ti.com
4.105 DMV
Move Two Independent Registers to a Register Pair
31 29 28 27 23 22 18 17 13 12 11 10 9 6 5 2 1 0
3 5 5 5 2 4 4
31 29 28 27 23 22 18 17 13 12 11 5 4 2 1 0
Description This instruction moves two registers into a register pair. This performs 2 moves at once
and is required when performing large amounts of double word processing
Execution if(cond){
0 + src2_e -> dst_e
else nop
4-234 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.105 DMV
www.ti.com —
Delay Slots 0
Functional Unit Latency 1
See Also
Example A0 == 0x87654321
A1 == 0x12345678
DMV .L A0,A1,A3:A2
A3 == 0x87654321
A2 == 0x12345678
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-235
Submit Documentation Feedback
4.106 DMVD
— www.ti.com
4.106 DMVD
Move Two Independent Registers to a Register Pair, Delayed
31 30 29 28 27 23 22 18 17 13 12 11 10 9 6 5 2 1 0
5 5 5 2 4 4
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 2 1 0
0 0 0 1 dst src2 src1 x opfield 110 s p
5 5 5 7 3
Description This instruction moves two registers into a register pair. This performs 2 moves at once
and is useful when performing large amounts of double word processing. The
writeback is delayed by 4 clocks to reduce register pressure.
Execution if(cond){
0 + src2_e -> dst_e
0 + src2_o -> dst_o
else nop
Delay Slots 3
See Also
Example A1 == 0x87654321
A6 == 0x12345678
DMVD .L A1,A6,A3:A2
A3 == 0x87654321
A2 == 0x12345678
4-236 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.107 DOTP2
www.ti.com —
4.107 DOTP2
Dot Product, Signed, Packed 16-Bit
or
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 6 5 4 3 2 1 0
src1 x 0 op 1 1 0 0 s p
5 1 5 1 1
Description Returns the dot-product between two pairs of signed, packed 16-bit values. The values
in src1 and src2 are treated as signed, packed 16-bit quantities. The signed result is
written either to a single 32-bit register, or sign-extended into a 64-bit register pair.
The product of the lower halfwords of src1 and src2 is added to the product of the upper
halfwords of src1 and src2. The result is then written to the dst.
If the result is sign-extended into a 64-bit register pair, the upper word of the register
pair always contains either all 0s or all 1s, depending on whether the result is positive
or negative, respectively.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-237
Submit Documentation Feedback
4.107 DOTP2
— www.ti.com
31 16 15 0
a_hi a_lo ←src1
DOTP2
63 32 31 0
0 or F a_hi × b_hi + a_lo × b_lo ←dst_o:dst_e
The 32-bit result version returns the same results that the 64-bit result version does in
the lower 32 bits. The upper 32-bits are discarded.
31 16 15 0
a_hi a_lo ←src1
DOTP2
=
31 0
a_hi × b_hi + a_lo × b_lo ←dst
Note—In the overflow case, where all four halfwords in src1 and src2 are 8000h,
the value 8000 0000h is written into the 32-bit dst and 0000 0000 8000 0000h is
written into the 64-bit dst.
Execution if (cond)(lsb16(src1) × lsb16(src2)) + (msb16(src1) × msb16(src2)) → dst
4-238 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.107 DOTP2
www.ti.com —
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
Example 2
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 FFFF FFFFh E6DF F6D4h
-421,529,900
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-239
Submit Documentation Feedback
4.107 DOTP2
— www.ti.com
Example 4
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 0000 0000h 12FC 544Dh
318,526,541
4-240 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.108 DOTP4H
www.ti.com —
4.108 DOTP4H
Dot Product, Signed by Signed, Packed 16-bit
31 29 28 27 23 22 18 17 13 12 11 10 6 5 4 5 2 1 0
3 5 5 5 5
Description The DOTP4H instruction returns the dot-product between a vector of four 16-bit two
sets of packed 16-bit values.
The values in dwop1 and xdwop2 are treated as signed packed 16-bit quantities
For each pair of 16-bit quantities in op1 and xop2, the signed 16-bit value from op1 is
multiplied with the signed 16-bit value from xop2. The four products are summed
together, and the resulting dot product is written to either a 32-bit result or to a 64-bit
signed result.
For the 32-bit destination form, the result is saturated to 32-bits, and the sat bits are set
in CSR and SSR.
=
a_3 * b_3 + a_2 * b_2 + a_1 * b_1 + a_0 * b_0 ← dst
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-241
Submit Documentation Feedback
4.108 DOTP4H
— www.ti.com
Execution TBD
Instruction Type 4-cycle
Delay Slots 3
Example A3 == 0x321089AB
A2 == 0x321089AB
A1 == 0x87654321
A0 == 0x87654321
DOTP4H .M A3:A2,A1:A0,A14:A13
A14 <== 0xFFFFFFFF
A13 <== 0x92C560B6
CSR == 0x10010000
A3 == 0x7FFF7FFF
A2 == 0x7FFF7FFF
A1 == 0x7FFF7FFF
A0 == 0x7FFF7FFF
DOTP4H .M A3:A2,A1:A0 ; Maximum Positive inputs .
A14 <== 0x7FFFFFFF
CSR<== 0x10010200
4-242 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.109 DOTPN2
www.ti.com —
4.109 DOTPN2
Dot Product With Negate, Signed, Packed 16-Bit
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 0 0 1 1 1 0 0 s p
5 1 1 1
Description Returns the dot-product between two pairs of signed, packed 16-bit values where the
second product is negated. The values in src1 and src2 are treated as signed, packed
16-bit quantities. The signed result is written to a single 32-bit register.
The product of the lower halfwords of src1 and src2 is subtracted from the product of
the upper halfwords of src1 and src2. The result is then written to dst.
31 16 15 0
a_hi a_lo ←src1
DOTPN2
31 0
a_hi × b_hi - a_lo × b_lo ←dst
Execution Note that unlike DOTP2, no overflow case exists for this instruction.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-243
Submit Documentation Feedback
4.109 DOTPN2
— www.ti.com
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
Example 2
4-244 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.110 DOTPNRSU2
www.ti.com —
4.110 DOTPNRSU2
Dot Product With Negate, Shift and Round, Signed by Unsigned, Packed 16-Bit
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 1 1 1 1 0 0 s p
5 1 1 1
Description Returns the dot-product between two pairs of packed 16-bit values, where the second
product is negated. This instruction takes the result of the dot-product and performs
an additional round and shift step. The values in src1 are treated as signed, packed
16-bit quantities; whereas, the values in src2 are treated as unsigned, packed 16-bit
quantities. The results are written to dst.
The product of the lower halfwords of src1 and src2 is subtracted from the product of
the upper halfwords of src1 and src2. The value 215 is then added to this sum, producing
an intermediate 33-bit result. The intermediate result is signed shifted right by 16,
producing a rounded, shifted result that is sign extended and placed in dst.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-245
Submit Documentation Feedback
4.110 DOTPNRSU2
— www.ti.com
31 16 15 0
sa_hi sa_lo ←src1
DOTPNRSU2
31 0
(((sa_hi × ub_hi) - (sa_lo × ub_lo)) + 8000h) >> 16 ←dst
Execution if (cond) {
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
Example 2
4-246 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.110 DOTPNRSU2
www.ti.com —
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-247
Submit Documentation Feedback
4.111 DOTPNRUS2
— www.ti.com
4.111 DOTPNRUS2
Dot Product With Negate, Shift and Round, Unsigned by Signed, Packed 16-Bit
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 1 1 1 1 0 0 s p
5 1 1 1
Description The DOTPNRUS2 pseudo-operation performs the dot-product between two pairs of
packed 16-bit values, where the second product is negated. This instruction takes the
result of the dot-product and performs an additional round and shift step. The values
in src1 are treated as signed, packed 16-bit quantities; whereas, the values in src2 are
treated as unsigned, packed 16-bit quantities. The results are written to dst. The
assembler uses the DOTPNRSU2src1, src2, dst instruction to perform this task.
The product of the lower halfwords of src1 and src2 is subtracted from the product of
the upper halfwords of src1 and src2. The value 215 is then added to this sum, producing
an intermediate 32 or 33-bit result. The intermediate result is signed shifted right by 16,
producing a rounded, shifted result that is sign extended and placed in dst.
4-248 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.111 DOTPNRUS2
www.ti.com —
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-249
Submit Documentation Feedback
4.112 DOTPRSU2
— www.ti.com
4.112 DOTPRSU2
Dot Product With Shift and Round, Signed by Unsigned, Packed 16-Bit
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 0 1 1 1 0 0 s p
5 1 1 1
Description Returns the dot-product between two pairs of packed 16-bit values. This instruction
takes the result of the dot-product and performs an additional round and shift step. The
values in src1 are treated as signed packed 16-bit quantities; whereas, the values in src2
are treated as unsigned packed 16-bit quantities. The results are written to dst.
The product of the lower halfwords of src1 and src2 is added to the product of the upper
halfwords of src1 and src2. The value 215is then added to this sum, producing an
intermediate 32 or 33-bit result. The intermediate result is signed shifted right by 16,
producing a rounded, shifted result that is sign extended and placed in dst.
4-250 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.112 DOTPRSU2
www.ti.com —
31 16 15 0
sa_hi sa_lo ←src1
DOTPRSU2
31 0
(((sa_hi × ub_hi) + (sa_lo × ub_lo)) + 8000h) >> 16 ←dst
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-251
Submit Documentation Feedback
4.112 DOTPRSU2
— www.ti.com
Example 2
Example 3
4-252 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.113 DOTPRUS2
www.ti.com —
4.113 DOTPRUS2
Dot Product With Shift and Round, Unsigned by Signed, Packed 16-Bit
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 0 1 1 1 0 0 s p
5 1 1 1
Description The DOTPRUS2 pseudo-operation returns the dot-product between two pairs of
packed 16-bit values. This instruction takes the result of the dot-product, and performs
an additional round and shift step. The values in src1 are treated as signed packed 16-bit
quantities; whereas, the values in src2 are treated as unsigned packed 16-bit quantities.
The results are written to dst. The assembler uses the DOTPRSU2 (.unit) src1, src2, dst
instruction to perform this task.
The product of the lower halfwords of src1 and src2 is added to the product of the upper
halfwords of src1 and src2. The value 215is then added to this sum, producing an
intermediate 32-bit result. The intermediate result is signed shifted right by 16,
producing a rounded, shifted result that is sign extended and placed in dst.
if (cond) {
}
else nop
if (cond) {
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-253
Submit Documentation Feedback
4.113 DOTPRUS2
— www.ti.com
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
4-254 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.114 DOTPSU4
www.ti.com —
4.114 DOTPSU4
Dot Product, Signed by Unsigned, Packed 8-Bit
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 0 1 0 1 1 0 0 s p
5 1 1 1
Description Returns the dot-product between four sets of packed 8-bit values. The values in src1 are
treated as signed packed 8-bit quantities; whereas, the values in src2 are treated as
unsigned 8-bit packed data. The signed result is written into dst.
For each pair of 8-bit quantities in src1 and src2, the signed 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2. The four products are summed
together, and the resulting dot product is written as a signed 32-bit result to dst.
31 24 23 16 15 8 7 0
sa_3 sa_2 sa_1 sa_0 ←src1
DOTPSU4
31 0
(sa_3 × ub_3) + (sa_2 × ub_2) + (sa_1 × ub_1) + (sa_0 × ub_0) ←dst
Execution if (cond) {
(sbyte0(src1) × ubyte0(src2)) +
(sbyte1(src1) × ubyte1(src2)) +
(sbyte2(src1) × ubyte2(src2)) +
(sbyte3(src1) × ubyte3(src2)) → dst
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-255
Submit Documentation Feedback
4.114 DOTPSU4
— www.ti.com
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
Example 2
4-256 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.115 DOTPSU4H
www.ti.com —
4.115 DOTPSU4H
Dot Product, Signed by Unsigned, Packed 16-bit
31 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
3 5 5 5 5
Description The DOTPSU4H instruction returns the dot-product between four sets of packed
16-bit values. This is essentially a multiply and add operation. The values in dwop1 are
treated as signed packed 16-bit quantities, whereas the values in xdwop2 are treated as
unsigned packed 16-bit quantities.
For each pair of 16-bit quantities in op1 and xop2, the signed 16-bit value from op1 is
multiplied with the unsigned 16-bit value from xop2. The four products are summed
together, and the resulting dot product is written to either a 32-bit result or to a 64-bit
signed result.
For the 32-bit destination form, the result is saturated to 32-bits, and the sat bits are set
in CSR and SSR
63 48 47 32 31 16 15 0
a_3 a_2 a_1 a_0 ←op1
dotpsu4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-257
Submit Documentation Feedback
4.115 DOTPSU4H
— www.ti.com
Execution TBD
See Also
CSR= 0x10010000
CSR == 0x00000000
B1 == 0xffffffff
B0 == 0xffffffff
B3 == 0xffffffff
B2 == 0xffffffff
DOTPSU4H .M B1:B0,B3:B2,B11:B10
B11 == 0xffffffff
B10 == 0xfffc0004
CSR= 0x10010000
CSR == 0x00000000
CSR == 0x00000000
A1 == 0x7f7f7f7f
A0 == 0x7f7f7f7f
A3 == 0xffffffff
A2 == 0xffffffff
DOTPSU4H .M A1:A0,A3:A2
A11 == 0x7fffffff
CSR= 0x10010200 ; 1fdfa0204 -> 7fffffff
CSR == 0x00000000
4-258 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.116 DOTPUS4
www.ti.com —
4.116 DOTPUS4
Dot Product, Unsigned by Signed, Packed 8-Bit
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 0 1 0 1 1 0 0 s p
5 1 1 1
Description The DOTPUS4 pseudo-operation returns the dot-product between four sets of packed
8-bit values. The values in src1 are treated as signed packed 8-bit quantities; whereas,
the values in src2 are treated as unsigned 8-bit packed data. The signed result is written
into dst. The assembler uses the DOTPSU4 (.unit)src1, src2, dst instruction to perform
this task (see ).
For each pair of 8-bit quantities in src1 and src2, the signed 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2. The four products are summed
together, and the resulting dot-product is written as a signed 32-bit result to dst.
Execution if (cond) {
(ubyte0(src2) × sbyte0(src1)) +
(ubyte1(src2) × sbyte1(src1)) +
(ubyte2(src2) × sbyte2(src1)) +
(ubyte3(src2) × sbyte3(src1)) → dst
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
See Also DOTPU4, DOTPSU4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-259
Submit Documentation Feedback
4.117 DOTPU4
— www.ti.com
4.117 DOTPU4
Dot Product, Unsigned, Packed 8-Bit
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 1 0 1 1 0 0 s p
5 1 1 1
Description Returns the dot-product between four sets of packed 8-bit values. The values in both
src1 and src2 are treated as unsigned, 8-bit packed data. The unsigned result is written
into dst.
For each pair of 8-bit quantities in src1 and src2, the unsigned 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2. The four products are summed
together, and the resulting dot-product is written as a 32-bit result to dst.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ←src1
DOTPU4
31 0
(ua_3 × ub_3) + (ua_2 × ub_2) + (ua_1 × ub_1) + (ua_0 × ub_0) ←dst
Execution if (cond) {
(ubyte0(src1) × ubyte0(src2)) +
(ubyte1(src1) × ubyte1(src2)) +
(ubyte2(src1) × ubyte2(src2)) +
(ubyte3(src1) × ubyte3(src2)) → dst
4-260 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.117 DOTPU4
www.ti.com —
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-261
Submit Documentation Feedback
4.118 DPACK2
— www.ti.com
4.118 DPACK2
Parallel PACK2 and PACKH2 Operations
Compatibility
Opcode
31 30 29 28 27 24 23 22t 18
0 0 0 1 dst 0 src2
4 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 0 1 0 0 1 1 0 s p
5 1 1 1
The PACK2 function of the DPACK2 instruction takes the lower halfword from src1
and the lower halfword from src2, and packs them both into dst_e. The lower halfword
of src1 is placed in the upper halfword of dst_e. The lower halfword of src2 is placed in
the lower halfword of dst_e.
The PACKH2 function of the DPACK2 instruction takes the upper halfword from src1
and the upper halfword from src2, and packs them both into dst_o. The upper halfword
of src1 is placed in the upper halfword of dst_o. The upper halfword of src2 is placed in
the lower halfword of dst_o.
4-262 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.118 DPACK2
www.ti.com —
Delay Slots 0
A0 87654321h A2 43215678h
A1 12345678h A3 87651234h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-263
Submit Documentation Feedback
4.119 DPACKH2
— www.ti.com
4.119 DPACKH2
2-Way SIMD Pack 16 MSBs Into Upper and Lower Register Halves
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 2 1 0
5 5 5 7 3
31 30 29 28 27 23 22 18 17 13 12 11 6 5 2 1 0
0 0 0 1 dst src2 src1 x opfield 1000 s p
5 5 5 6 4
Description The DPACKH2 instruction takes the high half-words from each of the words in dwop1
and xdwop2 and packs them both into dwdst. The upper half-word of each word in
dwop1 is placed in the upper half-word of each word in dwdst. The upper half-word of
each word in xdwop2 is placed in the lower half-word of each word in dwdst
This instruction is useful for manipulating and preparing pairs of 16-bit values to be
used by the packed arithmetic operations, such as ADD2.
63 48 47 31 30 16 15 0
v v v v
QRSTUVWXYZABCDEF qrstuvwxyzabcdef ABCDEFGHIJKLMNOP abcdefghijklmnop ← dst
Execution msb16(src2_e)->lsb16(dst_e)
msb16(src1_e)->msb16(dst_e)
msb16(src2_o)->lsb16(dst_o)
4-264 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.119 DPACKH2
www.ti.com —
msb16(src1_o)->msb16(dst_o)
Instruction Type Single-cycle
Delay Slots 0
Example A1 == 0xabcd1111
A0 == 0xabcdabcd
A3 == 0xbabe2222
A2 == 0xc0def00d
DPACKH2 .L A1:A0,A3:A2,A5:A4
A5 == 0xabcddcba
A4 == 0xabcddab
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-265
Submit Documentation Feedback
4.120 DPACKH4
— www.ti.com
4.120 DPACKH4
2-Way SIMD Pack Four High Bytes Into Four 8-Bit Halfwords
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 2 1 0
5 5 5 7 3
Description The DPACKH4 instruction behaves precisely two PACKH4 instructions executed in
parallel. E.g., the instruction
PACKH4 A7,A5,A3
|| PACKH4 A6,A4,A2
Delay Slots 0
Example A5 == 0xbe55ef55
A4 == 0x44556677
A3 == 0xde55ad55
A2 == 0x00112233
DPACKH4 .L A5:A4,A3:A2,A1:A0
A1 == 0xbeefdead
A0 == 0x44660022
4-266 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.121 DPACKHL2
www.ti.com —
4.121 DPACKHL2
2-Way SIMD Pack 16 MSB Into Upper and 16 LSB Into Lower Register Halves
31 30 29 28 27 23 22 18 17 13 12 11 6 5 2 1 0
5 5 5 6 4
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 2 1 0
0 0 0 1 dst src2 src1 x opcode 110 s p
5 5 5 7 3
Description The DPACKHL2 instruction takes the high half-words from the words in dwop1 and
low half-words from xdwop2 and packs them both into the words in dwdst. The upper
half-word of each word in dwop1 is placed in the upper half-word of each word in
dwdst. The lower half-word of each word of xdwop2 is placed in the lower half-word of
each word of dwdst.
This instruction is useful for manipulating and preparing pairs of 16-bit values to be
used by the packed arithmetic operations, such as ADD2.
63 48 47 31 30 16 15 0
v v v v
QRSTUVWXYZABCDEF qrstuvwxyzabcdef ABCDEFGHIJKLMNOP abcdefghijklmnop ← dst
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-267
Submit Documentation Feedback
4.121 DPACKHL2
— www.ti.com
Execution lsb16(src2_e)->lsb16(dst_e);
msb16(src1_e)->msb16(dst_e);
lsb16(src2_o)->lsb16(dst_o);
msb16(src1_o)->msb16(dst_o);
Delay Slots 0
Example A3 == 0xceea1111
A2 == 0xddbabacf
A1 == 0x2222bcae
A0 == 0xc0def00d
DPACKHL2 .L A3:A2,A1:A0,A15:A14
A15 == 0xabefefbe
A14 == 0xdebaf00d
4-268 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.122 DPACKL2
www.ti.com —
4.122 DPACKL2
2-Way SIMD Pack 16 LSBSs Into Upper and Lower Register Halves
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 2 1 0
5 5 5 7 3
31 30 29 28 27 23 22 18 17 13 12 11 10 9 6 5 2 1 0
0 0 0 1 dst src2 src1 x 11 opfield 1100 s p
5 5 5 2 4 4
Description The DPACKL2 instruction takes the low half-words from each of the words in dwop1
and xdwop2 and packs them both into dwdst. The lower half-word of each word in
dwop1 is placed in the low half-word of each word in dwdst. The lower half-word of
each word in xdwop2 is placed in the upper half-word of each word in dwdst.
This instruction is useful for manipulating and preparing pairs of 16-bit values to be
used by the packed arithmetic operations, such as ADD2.
63 48 47 31 30 16 15 0
v v v v
QRSTUVWXYZABCDEF qrstuvwxyzabcdef ABCDEFGHIJKLMNOP abcdefghijklmnop ← dst
Execution lsb16(src2_e)->lsb16(dst_e);
lsb16(src1_e)->msb16(dst_e);
lsb16(src2_o)->lsb16(dst_o);
lsb16(src1_o)->msb16(dst_o);
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-269
Submit Documentation Feedback
4.122 DPACKL2
— www.ti.com
Delay Slots 0
Functional Unit Latency 1
Example A1 == 0xaaaabebe
A0 == 0xdeadbabf
A3 == 0xccccbbaf
A2 == 0xc0def00d
DPACKL2 .L A1:A0,A3:A2,A5:A4
A5 == 0xcabebeab
A4 == 0xbeefc00d
4-270 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.123 DPACKL4
www.ti.com —
4.123 DPACKL4
2-Way SIMD Pack Four Low Bytes Into Four 8-bit Halfwords
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 2 1 0
5 5 5 7 3
Description The DPACKL4 instruction behaves precisely two PACKL4 instruction executed in
parallel. E.g., the instruction
PACKL4 A7,A5,A3
|| PACKL4 A6,A4,A2
Delay Slots 0
Example A7 == 0x55de55ad
A6 == 0x89012345
A3 == 0x55be55ef
A2 == 0x01234567
DPACKL4 .L A7:A6,A3:A2,A1:A0
A1 == 0xdeafbacf
A0 == 0x01452367
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-271
Submit Documentation Feedback
4.124 DPACKLH2
— www.ti.com
4.124 DPACKLH2
2-Way SIMD Pack 16 LSB Into Upper and 16 MSB Into Lower Register Halves
Opcode Opcode for .L Unit, 1/2 src — same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
4 5 5 5 7 3
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x opcode 1 0 0 0 s p
4 5 5 5 6 4
Description The DPACKLH2 instruction takes the low half-words from each of the words in dwop1
and high half-words from each of the words in xdwop2 and packs them both into dwdst.
The lower half-word of dwop1 is placed in the upper half-word of each word in dwdst.
The upper half-word of each word in xdwop2 is placed in the lower half-word of each
word in dwdst.
This instruction is useful for manipulating and preparing pairs of 16-bit values to be
used by the packed arithmetic operations, such as ADD2.
63 48 47 31 30 16 15 0
v v v v
QRSTUVWXYZABCDEF qrstuvwxyzabcdef ABCDEFGHIJKLMNOP abcdefghijklmnop ←dst
4-272 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.124 DPACKLH2
www.ti.com —
Execution msb16(src2_e)->lsb16(dst_e)
lsb16(src1_e)->msb16(dst_e)
msb16(src2_o)->lsb16(dst_o)
lsb16(src1_o)->msb16(dst_o)
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-273
Submit Documentation Feedback
4.125 DPACKLH4
— www.ti.com
4.125 DPACKLH4
2-Way SIMD Pack High Bytes of Four Half-Words to Packed 8-bit, and Low Bytes Into
Packed 8-bits
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
4 5 5 5 7 3
Description The DPACKH4 instruction behaves precisely as PACKH4 and PACKL4 instructions
executed in parallel on the same data set. E.g., the instruction
DPACKH4 A6, A4, A3:A2
Delay Slots 0
4-274 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.126 DPACKX2
www.ti.com —
4.126 DPACKX2
Parallel PACKLH2 Operations
Compatibility
Opcode
31 30 29 28 27 24 23 22 18
0 0 0 1 dst 0 src2
4 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 0 0 1 1 1 1 0 s p
5 1 1 1
One PACKLH2 function of the DPACKX2 instruction takes the lower halfword from
src1 and the upper halfword from src2, and packs them both into dst_e. The lower
halfword of src1 is placed in the upper halfword of dst_e. The upper halfword of src2 is
placed in the lower halfword of dst_e.
The other PACKLH2 function of the DPACKX2 instruction takes the upper halfword
from src1 and the lower halfword from src2, and packs them both into dst_o. The upper
halfword of src1 is placed in the lower halfword of dst_o. The lower halfword of src2 is
placed in the upper halfword of dst_o.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-275
Submit Documentation Feedback
4.126 DPACKX2
— www.ti.com
Delay Slots 0
Examples Example 1
A0 87654321h A2 43211234h
A1 12345678h A3 56788765h
Example 2
A0 3FFF8000h A2 80004000h
B0 40007777h A3 77773FFFh
4-276 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.127 DPINT
www.ti.com —
4.127 DPINT
Convert Double-Precision Floating-Point Value to Integer
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 0 1 0 0 0 1 1 0 s p
1 1 1
Description The 64-bit double-precision value in src2 is converted to an integer and placed in dst.
The operand is read in one cycle by using the src2 port for the 32 MSBs and the src1 port
for the 32 LSBs.
Note—
1) If src2 is NaN, the maximum signed integer (7FFF FFFFh or
8000 0000h) is placed in dst and the INVAL bit is set.
2) If src2 is signed infinity or if overflow occurs, the maximum signed integer
(7FFF FFFFh or 8000 0000h) is placed in dst and the INEX and OVER bits are
set. Overflow occurs if src2 is greater than
231 −1 or less than −231.
3) If src2 is denormalized, 0000 0000h is placed in dst and the INEX and DEN2
bits are set.
4) If rounding is performed, the INEX bit is set.
Execution if (cond) int(src2) → dst
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-277
Submit Documentation Feedback
4.127 DPINT
— www.ti.com
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2_l, src2_h
Written dst
Unit in use .L
Delay Slots 3
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 3333 3333h
4-278 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.128 DPSP
www.ti.com —
4.128 DPSP
Convert Double-Precision Floating-Point Value to Single-Precision Floating-Point
Value
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 0 1 0 0 1 1 1 0 s p
1 1 1
Description The double-precision 64-bit value in src2 is converted to a single-precision value and
placed in dst. The operand is read in one cycle by using the src2 port for the 32 MSBs
and the src1 port for the 32 LSBs.
Note—
1) If rounding is performed, the INEX bit is set.
2) If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2 bits are set.
3) If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
4) If src2 is a signed denormalized number, signed 0 is placed in dst and the INEX
and DEN2 bits are set.
5) If src2 is signed infinity, the result is signed infinity and the INFO bit is set.
6) If overflow occurs, the INEX and OVER bits are set and the results are set as
follows (LFPN is the largest floating-point number):
7) If underflow occurs, the INEX and UNDER bits are set and the results are set
as follows (SPFN is the smallest floating-point number):
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-279
Submit Documentation Feedback
4.128 DPSP
— www.ti.com
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2_l, src2_h
Written dst
Unit in use .L
Delay Slots 3
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 3333 3333h
4-280 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.129 DPTRUNC
www.ti.com —
4.129 DPTRUNC
Convert Double-Precision Floating-Point Value to Integer With Truncation
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 0 0 0 0 1 1 1 0 s p
1 1 1
Description The 64-bit double-precision value in src2 is converted to an integer and placed in dst.
This instruction operates like DPINT except that the rounding modes in the
floating-point adder configuration register (FADCR) are ignored; round toward zero
(truncate) is always used. The 64-bit operand is read in one cycle by using the src2 port
for the 32 MSBs and the src1 port for the 32 LSBs.
Note—
1) If src2 is NaN, the maximum signed integer (7FFF FFFFh or
8000 0000h) is placed in dst and the INVAL bit is set.
2) If src2 is signed infinity or if overflow occurs, the maximum signed integer
(7FFF FFFFh or 8000 0000h) is placed in dst and the INEX and OVER bits are
set. Overflow occurs if src2 is greater than
231 −1 or less than −231.
3) If src2 is denormalized, 0000 0000h is placed in dst and the INEX and DEN2
bits are set.
4) If rounding is performed, the INEX bit is set.
Execution if (cond) int(src2) → dst
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-281
Submit Documentation Feedback
4.129 DPTRUNC
— www.ti.com
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2_l, src2_h
Written dst
Unit in use .L
Delay Slots 3
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 3333 3333h
4-282 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.130 DSADD
www.ti.com —
4.130 DSADD
2-Way SIMD Addition With Saturation, Packed Signed 32-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
4 5 5 5 6 4
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x opfield 1 1 0 s p
4 5 5 5 7 3
Description The DADD instruction performs two saturating 32-bit additions of the packed 32-bit
numbers contained in the two source register pairs. The addition results are returned
as two 32-bit results packed into dwdst.
63 32 31 0
high1 low1 ← dwop1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-283
Submit Documentation Feedback
4.130 DSADD
— www.ti.com
Delay Slots 0
Functional Unit Latency 1
Example A1 == 0x44444444
A0 == 0x7fffffff
A3 == 0xccccc444
A2 == 0x00000001
DSADD .L A1:A0,A3:A2,A15:A14
A15 == 0x11110888
A14 == 0x7fffffff
4-284 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.131 DSADD2
www.ti.com —
4.131 DSADD2
4-Way SIMD Addition with Saturation, Packed Signed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
4 5 5 5 2 4 4
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x opfield 1 0 0 s p
4 5 5 5 7 3
Description The DSADD2 instruction performs four saturating 16-bit additions of the packed
signed 16-bit numbers contained in the two 64-bit source registers. The addition results
are returned as four signed 16-bit results packed into dst. Results are saturated to within
the range -2^15 to 2^15-1.
63 48 47 32 31 16 15 0
A B C D ←dwop1
W X Y Z ← xdwop2
v v v v
sat (A + W) sat (B + X) sat (C + Y) sat (D + Z) ← dwdst
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-285
Submit Documentation Feedback
4.131 DSADD2
— www.ti.com
Delay Slots 0
Example A1 == 0x44444444
A0 == 0x7fff0002
A3 == 0xccccc444
A2 == 0x7ff00005
DSADD2 .L A1:A0,A3:A2,A15:A14
A15 == 0x11100888
A14 == 0x7fff0007
4-286 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.132 DSHL
www.ti.com —
4.132 DSHL
2-Way SIMD Shift Left, Packed Signed 32-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
4 5 5 5 6 4
Description The DSHL instruction shifts the two 32-bit values in xdwop1 to the right by op2 bits.
Both values are shifted by the same shift count. Bits shifted past bit locations 0 and 32
are lost, and the result is zero extended. Only the lower 6 bits of op2 are used for the
shift count. Bits 6-31 are ignored. A shift count between 32 and 63 will produce the
same result as a shift by 31.
63 32 31 0
ABCDEFGH IJKLMNOP QRSTUVWX YZabcdef abcdefgh ijklmnop qrstuvwx yzABCDEF ← xdwop1
(for op2 = 8) ← ←
← ←
← ←
IJKLMNOP QRSTUVWX YZabcdef 00000000 ijklmnop qrstuvwx yzABCDEF 00000000 ← dwdst
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-287
Submit Documentation Feedback
4.132 DSHL
— www.ti.com
A1 == 0x00000009
A0 == 0x419751A5
A2 == 0x00000020
DSHL .S A1:A0,A2,A15:A14
A15 == 0x00000000
A14 == 0x00000000
4-288 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.133 DSHL2
www.ti.com —
4.133 DSHL2
4-Way SIMD Shift Left, Packed Signed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
4 5 5 5 6 4
Description The DSHL2 instruction performs a left shift on packed 16-bit quantities. The values in
xdwop1 are viewed as four packed 16-bit quantities. The lower four bits of op2 or ucst5
are treated as a shift amount. The same shift amount is applied to all four input data.
The results are placed in a signed packed 16-bit format.
For each unsigned 16-bit quantity in xdwop1, the quantity is shifted left by the specified
number of bits. Bits shifted out of the most-significant bit of each 16-bit quantity are
discarded.
For correct operation bit 4 (the fifth bit) of the constant field (ucst5) or register field
(op2) must be set to 0.
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
ABCDEFGH IJKLMNOP QRSTUVWX YZabcdef abcdefgh ijklmnop qrstuvwx yzABCDEF ← xdwop1
(for op2 = 8) ← ←
← ←
← ←
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-289
Submit Documentation Feedback
4.133 DSHL2
— www.ti.com
Delay Slots 0
Example A1 == 0xFEDC7A98
A0 == 0x1234fedc
DSHL2 .S A1:A0,4,A3:A2
A3 == 0xEDC0A980
A2 == 0x2340edc0
A1 == 0xFEDC7A98
A0 == 0x1234fedc
A1 == 0x4
DSHL2 .S A1:A0,A1,A3:A2
A3 == 0xEDC0A980
A2 == 0x2340edc0
A1 == 0xFEDC7A98
A0 == 0x1234fedc
A1 == 0x16
ru .S A1:A0,A1,A3:A2
A3 == 0x00000000
A2 == 0x00000000
4-290 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.134 DSHR
www.ti.com —
4.134 DSHR
2-Way SIMD Shift Right, Packed Signed 32-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
4 5 5 5 6 4
Description The DSHR instruction shifts the two signed 32-bit values in xdwop1 to the right by op2
bits. Both values are shifted by the same shift count. Bits shifted past bit locations 0 and
32 are lost, and the result is sign extended. When a register is used, the five LSBs specify
the shift amount and valid values are 0-31 for correct operations. When an immediate
value is used, valid shift amounts are 0-31 for correct operation.
63 32 31 0
ABCDEFGH IJKLMNOP QRSTUVWX YZabcdef abcdefgh ijklmnop qrstuvwx yzABCDEF ← xdwop1
(for op2 = 8) → →
→ →
→ →
AAAAAAAA ABCDEFGH IJKLMNOP QRSTUVWX aaaaaaaa abcdefgh ijklmnop qrstuvwx ← dwdst
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-291
Submit Documentation Feedback
4.134 DSHR
— www.ti.com
A1 == 0x1234abff
A0 == 0xff333415
A2 == 0x00000020
DSHR .S A1:A0,A2,A15:A14
A15 == 0x00000000
A14 == 0xffffffff
4-292 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.135 DSHR2
www.ti.com —
4.135 DSHR2
4-Way SIMD Shift Right, Packed Signed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
4 5 5 5 2 4 4
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x opcode 1 0 0 0 s p
4 5 5 5 6 4
Description The DSHR2 instruction performs an arithmetic shift right on signed packed 16-bit
quantities. The values in xdwop1 are viewed as four signed packed 16-bit quantities.
The lower four bits of op2 or ucst5 are treated as a shift amount. The same shift amount
is applied to all four input data. The results are placed in a signed packed 16-bit format.
For each signed 16-bit quantity in xdwop1, the quantity is shifted right by the specified
number of bits. The shifted quantity is sign-extended, and placed in the corresponding
position in dst. Bits shifted out of the least-significant bit of each signed 16-bit quantity
are discarded.
For correct operation bit 4 (the fifth bit) of the constant field (ucst5) or register field
(op2) must be set to 0.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-293
Submit Documentation Feedback
4.135 DSHR2
— www.ti.com
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
(for op2 = 8) → →
→ →
→ →
Delay Slots 0
Functional Unit Latency 1
Example A1 == 0xFEDC7A98
A0 == 0x1234fedc
DSHR2 .S A1:A0,4,A15:A14
A15 == 0xFFED07A9
A14 == 0x0123ffed
A1 == 0xFEDC7A98
A0 == 0x1234fedc
A2 == 0x00000010
DSHR2 .S A1:A0,A2,A15:A14
A15 == 0xffff0000
A14 == 0x0000ffff
A1 == 0xFEDC7A98
A0 == 0x1234fedc
A2 == 0x00000044
DSHR2 .S A1:A0,A2,A15:A14
A15 == 0xffed07a9
A14 == 0x0123ffed
4-294 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.136 DSHRU
www.ti.com —
4.136 DSHRU
2-Way SIMD Shift Right, Packed Unsigned 32-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
4 5 5 5 6 4
Description The DSHRU instruction shifts the two unsigned 32-bit values in xdwop1 to the right by
op2 bits. Both values are shifted by the same shift count. Bits shifted past bit locations
0 and 32 are lost, and the result is zero extended. When a register is used, the five LSBs
specify the shift amount and valid values are 0-31 for correct operations. When an
immediate value is used, valid shift amounts are 0-31 for valid operations.
63 32 31 0
ABCDEFGH IJKLMNOP QRSTUVWX YZabcdef abcdefgh ijklmnop qrstuvwx yzABCDEF ← xdwop1
(for op2 = 8) → →
→ →
→ →
00000000 ABCDEFGH IJKLMNOP QRSTUVWX 00000000 abcdefgh ijklmnop qrstuvwx ← dwdst
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-295
Submit Documentation Feedback
4.136 DSHRU
— www.ti.com
A1 == 0x1234abff
A0 == 0xff333415
A2 == 0x50
DSHRU .S A1:A0,A2,A15:A14
A15 == 0x00001234
A14 == 0x0000ff33
4-296 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.137 DSHRU2
www.ti.com —
4.137 DSHRU2
4-Way SIMD Shift Right, Packed Unsigned 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
4 5 5 5 2 4 4
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x opfield 1 0 0 0 s p
4 5 5 5 6 4
Description The DSHRU2 instruction performs an shift right on unsigned packed 16-bit quantities.
The values in xdwop1 are viewed as four unsigned packed 16-bit quantities. The lower
four bits of op2 or ucst5 are treated as a shift amount. The same shift amount is applied
to all four input data. The results are placed in a signed packed 16-bit format.
For each unsigned 16-bit quantity in xdwop1, the quantity is shifted right by the
specified number of bits. The shifted quantity is zero-extended, and placed in the
corresponding position in dst. Bits shifted out of the least-significant bit of each signed
16-bit quantity are discarded.
For correct operation bit 4 (the fifth bit) of the constant field (ucst5) or register field
(op2) must be set to 0.
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
ABCDEFGH IJKLMNOP QRSTUVWX YZabcdef abcdefgh ijklmnop qrstuvwx yzABCDEF ← xdwop1
(for op2 = 8) → →
→ →
→ →
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-297
Submit Documentation Feedback
4.137 DSHRU2
— www.ti.com
Delay Slots 0
Example A1 == 0xFEDC7A98
A0 == 0x1234fedc
DSHRU2 .S A1:A0,4,A15:A14
A15 == 0x0FED07A9
A14 == 0x01230fed
A1 == 0xFEDC7A98
A0 == 0x1234fedc
A2 == 0x00000004
DSHRU2 .S A1:A0,A2,A15:A14
A15 == 0x0FED07A9
A14 == 0x01230fed
A1 == 0xFEDC7A98
A0 == 0x1234fedc
A2 == 0x00000044
DSHRU2 .S A1:A0,A2,A15:A14
A15 == 0x0FED07A9
A14 == 0x01230fed
4-298 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.138 DSMPY2
www.ti.com —
4.138 DSMPY2
4-Way SIMD Multiply Signed by Signed With Left Shift and Saturation, Packed Signed
16-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
4 5 5 5 5 4
Description The DSMPY2 instruction performs 16-bit multiplication between signed packed 16-bit
quantities, with an additional left-shift and saturate. The values in dwop1 and xdwop2
are treated as signed packed 16-bit quantities. The two 32-bit results are placed in two
64-bit register pairs.
The DSMPY2 instruction produces four 16 x 16 products. Each product is shifted left
by one, and if the left-shifted result is equal to 0x80000000, the output value is saturated
to 0x7FFFFFFF. If any product saturates, the SAT bit is set in the CSR on the cycle the
result is written. If no product saturates, the SAT bit is left unaffected.
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-299
Submit Documentation Feedback
4.138 DSMPY2
— www.ti.com
A8 == 0x00000002
A13 == 0xFFFFFFFF
A12 == 0xFFFFFFFF
A23 == 0xAAAAAAAA
A22 == 0xAAAAAAAA
DSMPY2 .M A13:A12,A23:A22,A11:A10:A9:A8
A11 == 0x0000AAAC
A10 == 0x0000AAAC
A9 == 0x0000AAAC
A8 == 0x0000AAAC
A13 == 0x7FFF7FFF
A12 == 0x7FFF7FFF
A23 == 0xAAAAAAAA
A22 == 0xAAAAAAAA
DSMPY2 .M A13:A12,A23:A22,A11:A10:A9:A8
A11 == 0xAAAAAAAC
A10 == 0xAAAAAAAC
A9 == 0xAAAAAAAC
A8 == 0xAAAAAAAC
A13 == 0x80018001
A12 == 0x80018001
A23 == 0x80018001
A22 == 0x80018001
DSMPY2 .M A13:A12,A23:A22,A11:A10:A9:A8
A11 == 0x7ffe0002
A10 == 0x7ffe0002
A9 == 0x7ffe0002
A8 == 0x7ffe0002
A13 == 0x7fff7fff
A12 == 0x7fff7fff
A23 == 0x7fff7fff
A22 == 0x7fff7fff
DSMPY2 .M A13:A12,A23:A22,A11:A10:A9:A8
A11 == 0x7ffe0002
A10 == 0x7ffe0002
A9 == 0x7ffe0002
A8 == 0x7ffe0002
A13 == 0xc333c333
A12 == 0xc333c333
A23 == 0x3ccdc333
A22 == 0x3ccdc333
DSMPY2 .M A13:A12,A23:A22,A11:A10:A9:A8
A11 == 0xe31e87ae
A10 == 0x1ce17852
A9 == 0xe31e87ae
A8 == 0x1ce17852
A13 == 0x00000001
A12 == 0x00000001
A23 == 0x80008000
A22 == 0x80008000
DSMPY2 .M A13:A12,A23:A22,A11:A10:A9:A8
A11 == 0x00000000
A10 == 0xffff0000
A9 == 0x00000000
A8 == 0xffff0000
CSR= 0x00000000
A13 == 0x80008000
A12 == 0x80008000
A23 == 0x80008000
A22 == 0x80008000
DSMPY2 .M A13:A12,A23:A22,A11:A10:A9:A8
A11 == 0x7FFFFFFF
A10 == 0x7FFFFFFF
A9 == 0x7FFFFFFF
A8 == 0x7FFFFFFF
CSR= 0x00000200
CSR == 0x00000000
B13 == 0xFFFFFFFF
B12 == 0xFFFFFFFF
B23 == 0x00010001
B22 == 0x00010001
DSMPY2 .M B13:B12,B23:B22,B11:B10:B9:B8
B11 == 0xFFFFFFFE
4-300 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.138 DSMPY2
www.ti.com —
B10 == 0xFFFFFFFE
B9 == 0xFFFFFFFE
B8 == 0xFFFFFFFE
B13 == 0xFFFFFFFF
B12 == 0xFFFFFFFF
B23 == 0xFFFFFFFF
B22 == 0xFFFFFFFF
DSMPY2 .M B13:B12,B23:B22,B11:B10:B9:B8
B11 == 0x00000002
B10 == 0x00000002
B9 == 0x00000002
B8 == 0x00000002
B13 == 0xFFFFFFFF
B12 == 0xFFFFFFFF
B23 == 0xAAAAAAAA
B22 == 0xAAAAAAAA
DSMPY2 .M B13:B12,B23:B22,B11:B10:B9:B8
B11 == 0x0000AAAC
B10 == 0x0000AAAC
B9 == 0x0000AAAC
B8 == 0x0000AAAC
B13 == 0x7FFF7FFF
B12 == 0x7FFF7FFF
B23 == 0xAAAAAAAA
B22 == 0xAAAAAAAA
DSMPY2 .M B13:B12,B23:B22,B11:B10:B9:B8
B11 == 0xAAAAAAAC
B10 == 0xAAAAAAAC
B9 == 0xAAAAAAAC
B8 == 0xAAAAAAAC
B13 == 0x80018001
B12 == 0x80018001
B23 == 0x80018001
B22 == 0x80018001
DSMPY2 .M B13:B12,B23:B22,B11:B10:B9:B8
B11 == 0x7ffe0002
B10 == 0x7ffe0002
B9 == 0x7ffe0002
B8 == 0x7ffe0002
B13 == 0x7fff7fff
B12 == 0x7fff7fff
B23 == 0x7fff7fff
B22 == 0x7fff7fff
DSMPY2 .M B13:B12,B23:B22,B11:B10:B9:B8
B11 == 0x7ffe0002
B10 == 0x7ffe0002
B9 == 0x7ffe0002
B8 == 0x7ffe0002
B13 == 0xc333c333
B12 == 0xc333c333
B23 == 0x3ccdc333
B22 == 0x3ccdc333
DSMPY2 .M B13:B12,B23:B22,B11:B10:B9:B8
B11 == 0xe31e87ae
B10 == 0x1ce17852
B9 == 0xe31e87ae
B8 == 0x1ce17852
B13 == 0x00000001
B12 == 0x00000001
B23 == 0x80008000
B22 == 0x80008000
DSMPY2 .M B13:B12,B23:B22,B11:B10:B9:B8
B11 == 0x00000000
B10 == 0xffff0000
B9 == 0x00000000
B8 == 0xffff0000
CSR= 0x00000000
B13 == 0x80008000
B12 == 0x80008000
B23 == 0x80008000
B22 == 0x80008000
DSMPY2 .M B13:B12,B23:B22,B11:B10:B9:B8
B11 == 0x7FFFFFFF
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-301
Submit Documentation Feedback
4.138 DSMPY2
— www.ti.com
B10 == 0x7FFFFFFF
B9 == 0x7FFFFFFF
B8 == 0x7FFFFFFF
CSR= 0x00000200
4-302 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.139 DSPACKU4
www.ti.com —
4.139 DSPACKU4
2-Way SIMD Saturate and Pack Into Unisgned Packed 8-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
4 5 5 5 2 4 4
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-303
Submit Documentation Feedback
4.139 DSPACKU4
— www.ti.com
4-304 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.140 DSPINT
www.ti.com —
4.140 DSPINT
2-Way SIMD Convert Single Precision Floating Point to Signed 32-bit Integer
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
3 5 5 5 10
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src opfield x 1 1 1 1 0 0 1 0 0 0 s p
3 5 5 5 10
Description Converts two single precision values packed in src2 to two integer values packed in dst.
Execution if(cond){
int(src2_e) -> dst_e
int(src2_o) -> dst_o
}
else nop
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-305
Submit Documentation Feedback
4.140 DSPINT
— www.ti.com
FADCR= 0x00000092
FADCR== 0x00000200
A7 == 0x4109999a
A6 == 0x7fc00000
DSPINT .L A7:A6,A3:A2 ; QNaN
A3 == 0x00000008
A2 == 0x7fffffff
FADCR= 0x00000292
FADCR== 0x00000400
A7 == 0x4109999a
A6 == 0x4effffff
DSPINT .L A7:A6,A3:A2
A3 == 0x00000009
A2 == 0x7fffff80
FADCR= 0x00000480
FADCR== 0x00000600
4-306 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.141 DSPINTH
www.ti.com —
4.141 DSPINTH
2-Way SIMD Convert Single Precision Floating Point to Signed 16-bit Integer
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
3 5 5 5 10
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src opfield x 1 1 1 1 0 0 1 0 0 0 s p
3 5 5 5 10
Description Converts two single precision floating point value in src2_e and src2_o to two signed
16-bit integer values packed in dst.
Execution if(cond){
int(src2_e) -> lsb16(dst)
int(src2_o) -> msb16(dst)
}
else nop
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-307
Submit Documentation Feedback
4.141 DSPINTH
— www.ti.com
FADCR== 0x00000200
4-308 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.142 DSSUB
www.ti.com —
4.142 DSSUB
2-Way SIMD Saturating Subtract, Packed Signed 32-bit
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
4 5 5 5 7 3
Description The DSSUB instruction performs two saturating 32-bit subtracts of the packed 32-bit
numbers contained in the two source register pairs. The subtraction results are
saturated to the range -231 to 231-1 and returned as two 32-bit results packed into dwdst.
63 32 31 0
high1 low1 ← dwop1
v v
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-309
Submit Documentation Feedback
4.142 DSSUB
— www.ti.com
A5 == 0xfffffff0
A4 == 0x7ffffff4
A9 == 0x7ffffff4
A8 == 0xffffff00
DSSUB . A5:A4,A9:A8,A1:A0
A1 == 0x80000000
A0 == 0x7fffffff
A5 == 0x7ffffff4
A4 == 0x7fffffff
A9 == 0x0ffffff0
A8 == 0xffffffff
DSSUB . A5:A4,A9:A8,A1:A0
A1 == 0x70000004
A0 == 0x7fffffff
A5 == 0x7ffffff0
A4 == 0x00000000
A9 == 0x80000000
A8 == 0xffffffff
DSSUB . A5:A4,A9:A8,A1:A0
A1 == 0x7fffffff
A0 == 0x00000001
A5 == 0x00000000
A4 == 0x00000000
A9 == 0x80000000
A8 == 0x7fffffff
DSSUB . A5:A4,A9:A8,A1:A0
A1 == 0x7fffffff
A0 == 0x80000001
A5 == 0xffffffff
A4 == 0xfffffffe
A9 == 0x7fffffff
A8 == 0x7fffffff
DSSUB . A5:A4,A9:A8,A1:A0
A1 == 0x80000000
A0 == 0x80000000
A5 == 0xfffffff0
A4 == 0x7ffffff4
A9 == 0x7ffffff4
A8 == 0xffffff00
DSSUB . A5:A4,A9:A8,A1:A0
A1 == 0x80000000
A0 == 0x7fffffff
A5 == 0xffffffff
A4 == 0xffffffff
A9 == 0xffffffff
A8 == 0xfffffffe
DSSUB . A5:A4,A9:A8,A1:A0
A1 == 0x00000000
A0 == 0x00000001
4-310 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.143 DSSUB2
www.ti.com —
4.143 DSSUB2
4-Way SIMD Saturating Subtract, Packed Signed 16-bit
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
4 5 5 5 7 3
Description The DSSUB2 instruction performs four saturating 16-bit subtractions of the packed
signed 16-bit numbers contained in the two 64-bit source registers. The subtraction
results are returned as four signed 16-bit results packed into dst. Results are saturated
to within the range -215 to 215-1.
63 48 47 32 31 16 15 0
A B C D ←dwop1
W X Y Z ← xdwop2
v v v v
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-311
Submit Documentation Feedback
4.143 DSSUB2
— www.ti.com
A1 == 0xffffffff
A0 == 0xffffffff
A3 == 0xffffffff
A2 == 0xfffffffe
DSSUB2 .L A1:A0,A3:A2,A15:A14
A15 == 0x00000000
A14 == 0x00000001
A1 == 0x80008000
A0 == 0x80008000
A3 == 0x00010001
A2 == 0xffffffff
DSSUB2 .L A1:A0,A3:A2,A15:A14
A15 == 0x80008000
A14 == 0x80018001
4-312 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.144 DSUB
www.ti.com —
4.144 DSUB
2-Way SIMD Subtract, Packed Signed 32-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
4 5 5 5 6 4
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x opfield 1 1 0 s p
4 5 5 5 7 3
Description The DSUB instruction performs two 32-bit subtractions of the packed 32-bit numbers
contained in the two source register pairs. The subtraction results are returned as two
32-bit results packed into dwdst.
63 32 31 0
high1 low1 ← dwop1
v v
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-313
Submit Documentation Feedback
4.144 DSUB
— www.ti.com
A1 == 0x7fffffff
A0 == 0x7ffffffc
A3 == 0x7ffffffe
A2 == 0xfffffffe
DSUB .L A1:A0,A3:A2,A15:A14
A15 == 0x00000001
A14 == 0x7ffffffe
A1 == 0x80000001
A0 == 0x80000000
A3 == 0xfffffffc
A2 == 0x00000001
DSUB .L A1:A0,A3:A2,A15:A14
A15 == 0x80000005
A14 == 0x7fffffff
A1 == 0x7fffffff
A0 == 0xffffffff
A3 == 0xffffffff
A2 == 0x7fffffff
DSUB .L A1:A0,A3:A2,A15:A14
A15 == 0x80000000
A14 == 0x80000000
4-314 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.145 DSUB2
www.ti.com —
4.145 DSUB2
4-Way SIMD Subtract, Packed Signed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
4 5 5 5 6 4
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x opfield 1 1 0 s p
4 5 5 5 7 3
Description The DSUB2 instruction performs four 16-bit subtractions of the packed 16-bit
numbers contained in the two 64-bit wide source registers. The subtraction results are
returned as four 16-bit results packed into dst.
63 48 47 32 31 16 15 0
A B C D ← dwop1
W X Y Z ← xdwop2
v v v v
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-315
Submit Documentation Feedback
4.145 DSUB2
— www.ti.com
Delay Slots 0
Example A1 == 0x44444444
A0 == 0x7fff7fff
A3 == 0xccccc444
A2 == 0x00020005
DSUB2 .L A1:A0,A3:A2,A15:A14
A15 == 0x77788000
A14 == 0x7ffd7ffa
4-316 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.146 DSUBSP
www.ti.com —
4.146 DSUBSP
2-Way SIMD Subtract, Packed Single Precision Floating Point
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
4 5 5 5 7 3
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x opfield 1 0 0 0 s p
4 5 5 5 6 4
Description Performs a SIMD single-precision floating point Subtract on the pairs of numbers in
dwop1 and xdwop2. The following are equivalent:
DSUBSP A1:A0, A3:A2, A5:A4
and
FSUBSP A1, A3, A5
FSUBSP A0, A2, A4
Execution if(cond) {
src1_e + src2_e -> dst_e
src1_o + src2_o -> dst_o
}
else nop
Delay Slots 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-317
Submit Documentation Feedback
4.146 DSUBSP
— www.ti.com
FADCR= 0x00000080
FADCR== 0x02000000
B3 == 0x40784517
B2 == 0x7fc00000
B7 == 0x42aeab2d
B6 == 0x42300000
DSUBSP .L B3:B2,B7:B6,B1:B0 ; low:QNaN
B1 == 0xc2a6e904
B0 == 0x7fffffff
FADCR= 0x02810000
FADCR== 0x00000400
A5 == 0x40784517
A4 == 0x41200000
A7 == 0x42aeab2d
A6 == 0x7f900000
DSUBSP .L A5:A4,A7:A6,A1:A0 ; low:SNaN
A1 == 0xc2a6e904
A0 == 0x7fffffff
FADCR= 0x00000492
FADCR== 0x06000000
B3 == 0x40784517
B2 == 0x7f800000
B7 == 0x42aeab2d
B6 == 0x7f800000
DSUBSP .L B3:B2,B7:B6,B1:B0 ; low: Inf-Inf=NaN_out
B1 == 0xc2a6e905
B0 == 0x7fffffff
FADCR= 0x06900000
FADCR== 0x06000000
B3 == 0x40784517
B2 == 0x7f800000
B7 == 0x42aeab2d
B6 == 0x42300000
DSUBSP .L B3:B2,B7:B6,B1:B0 ; low: Inf-xxx=Inf
B1 == 0xc2a6e905
B0 == 0x7f800000
FADCR= 0x06a00000
FADCR== 0x00000000
A5 == 0x7f7fffff
A4 == 0xff7fffff
A7 == 0xff780123
A6 == 0x7f780123
DSUBSP .L A5:A4,A7:A6,A1:A0 ; +Inf ; -Inf
A1 == 0x7f800000
A0 == 0xff800000
FADCR= 0x000000e0
FADCR== 0x00000200
A5 == 0x7f7fffff
A4 == 0xff7fffff
A7 == 0xff780123
A6 == 0x7f780123
DSUBSP .L A5:A4,A7:A6,A1:A0 ; +LFPN ; -LFPN
A1 == 0x7f7fffff
A0 == 0xff7fffff
FADCR= 0x000002c0
4-318 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.146 DSUBSP
www.ti.com —
FADCR== 0x00000400
A5 == 0x7f7fffff
A4 == 0xff7fffff
A7 == 0xff780123
A6 == 0x7f780123
DSUBSP .L A5:A4,A7:A6,A1:A0 ; +Inf ; -LFPN
A1 == 0x7f800000
A0 == 0xff7fffff
FADCR= 0x000004e0
FADCR== 0x00000600
A5 == 0x7f7fffff
A4 == 0xff7fffff
A7 == 0xff780123
A6 == 0x7f780123
DSUBSP .L A5:A4,A7:A6,A1:A0 ; +LFPN ;-Inf
A1 == 0x7f7fffff
A0 == 0xff800000
FADCR= 0x000006e0
FADCR== 0x00000000
A5 == 0x008d8000
A4 == 0x808d8000
A7 == 0x0008d0000
A6 == 0x808d0000
DSUBSP .L A5:A4,A7:A6,A1:A0 ; +0 ; -0
A1 == 0x00000000
A0 == 0x80000000
FADCR= 0x00000180
FADCR== 0x00000400
A5 == 0x008d8000
A4 == 0x808d8000
A7 == 0x0008d0000
A6 == 0x808d0000
DSUBSP .L A5:A4,A7:A6,A1:A0 ; +SFPN ;-0
A1 == 0x00800000
A0 == 0x80000000
FADCR= 0x00000580
FADCR== 0x00000600
A5 == 0x008d8000
A4 == 0x808d8000
A7 == 0x0008d0000
A6 == 0x808d0000
DSUBSP .L A5:A4,A7:A6,A1:A0 ; +0 ; -SFPN
A1 == 0x00000000
A0 == 0x80800000
FADCR= 0x00000780
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-319
Submit Documentation Feedback
4.147 DXPND2
— www.ti.com
4.147 DXPND2
Expand Bits to Packed 16-bit Masks
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
3 5 5 5 10
Description The DXPND2 instruction reads the four least-significant bits of xdwop and expands
them into a two halfword mask. Bits 1 and 0 are replicated to the upper and lower
halfwords of the even result register, and bits 3 and 2 are replicated to the upper and
lower halfwords of the odd result register, respectively. Bits 31 through 4 of xop are
explicitly ignored and may be non-zero.
This instruction is useful when combined with the output of DCMPGT2, for
generating a mask which corresponds to the individual halfword positions that were
compared. Such a mask may subsequently be used with ANDN, AND and OR
instructions to perform compositing or other multiplexing operations.
63 48 47 32 31 16 15 4 3 2 1 0
xxxxxxxxxxxxxxxx xxxxxxxxxxxx A B C D ← xop
v v
Execution if(cond) {
4-320 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.147 DXPND2
www.ti.com —
}
else nop
Delay Slots 1
Example A0 == 0x00000000
DXPND2 .M A0,A3:A2
A3 == 0x00000000
A2 == 0x00000000
A0 == 0x00000001
DXPND2 .M A0,A3:A2
A3 == 0x00000000
A2 == 0x0000ffff
A0 == 0x00000002
DXPND2 .M A0,A3:A2
A3 == 0x00000000
A2 == 0xffff0000
A0 == 0x00000003
DXPND2 .M A0,A3:A2
A3 == 0x00000000
A2 == 0xffffffff
A0 == 0x1234567d
DXPND2 .M A0,A3:A2
A3 == 0xffffffff
A2 == 0x0000ffff
A0 == 0x89abcdee
DXPND2 .M A0,A3:A2
A3 == 0xffffffff
A2 == 0xffff0000
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-321
Submit Documentation Feedback
4.148 DXPND4
— www.ti.com
4.148 DXPND4
Expand Bits to Packed 8-bit Masks
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
3 5 5 5 10
Description The DXPND4 instruction reads the 8 least-significant bits of xop and expands them
into an 8 byte mask. Bits 7 through 0 are replicated to bytes 7 through 0 of the result.
Bits 31 through 8 of xop are explicitly ignored and may be non-zero.
This instruction is useful when combined with the output of DCMPGTU4, for
generating a mask which corresponds to the individual byte positions that were
compared. Such a mask may subsequently be used with ANDN, AND and OR
instructions to perform compositing or other multiplexing operations.
Because DXPND4 only examines the four LSBs of xop, it is possible to store a large bit
mask in a single 32-bit word, and expand it using multiple SHR and DXPND4 pairs.
This can be useful for expanding a packed 1-bit/pixel bitmap into full 8-bit pixels.
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 6 5 4 3 2 1 0
v v
Execution if(cond) {
if(src2 & 1) 0xFF -> byte0(dst_e)
else 0x00 -> byte0(dst_e)
4-322 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.148 DXPND4
www.ti.com —
}
else nop
Delay Slots 1
Example A0 == 0x00000000
DXPND4 .M A0,A1:A0
A1 == 0x00000000
A0 == 0x00000000
A0 == 0x00000001
DXPND4 .M A0,A1:A0
A1 == 0x00000000
A0 == 0x000000ff
A0 == 0x00000002
DXPND4 .M A0,A1:A0
A1 == 0x00000000
A0 == 0x0000ff00
A0 == 0x00000004
DXPND4 .M A0,A1:A0
A1 == 0x00000000
A0 == 0x00ff0000
A0 == 0x00000008
DXPND4 .M A0,A1:A0
A1 == 0x00000000
A0 == 0xff000000
A0 == 0x00000005
DXPND4 .M A0,A1:A0
A1 == 0x00000000
A0 == 0x00ff00ff
A0 == 0x0000000a
DXPND4 .M A0,A1:A0
A1 == 0x00000000
A0 == 0xff00ff00
A0 == 0x0000000f
DXPND4 .M A0,A1:A0
A1 == 0x00000000
A0 == 0xffffffff
A0 == 0x12345675
DXPND4 .M A0,A1:A0
A1 == 0x00ffffff
A0 == 0x00ff00ff
A0 == 0x89abcdea
DXPND4 .M A0,A1:A0
A1 == 0xffffff00
A0 == 0xff00ff00
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-323
Submit Documentation Feedback
4.149 EXT
— www.ti.com
4.149 EXT
Extract and Sign-Extend a Bit Field
or
31 29 28 27 23 22 18
3 1 5 5
17 13 12 8 7 6 5 4 3 2 1 0
csta cstb 0 1 0 0 1 0 s p
5 5 1 1
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 1 1 1 1 0 0 0 s p
5 1 1 1
4-324 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.149 EXT
www.ti.com —
Description The field in src2, specified by csta and cstb, is extracted and sign-extended to 32 bits.
The extract is performed by a shift left followed by a signed shift right. csta and cstb are
the shift left amount and shift right amount, respectively. This can be thought of in
terms of the LSB and MSB of the field to be extracted. Then csta = 31 - MSB of the field
and cstb = csta + LSB of the field. The shift left and shift right amounts may also be
specified as the ten LSBs of the src1 register with cstb being bits 0-4 and csta bits 5-9. In
the example below, csta is 12 and cstb is 11 + 12 = 23. Only the ten LSBs are valid for
the register version of the instruction. If any of the 22 MSBs are non-zero, the result is
invalid.
csta cstb - csta
src2 1) X X X X X X X X X X X X 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
2) 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X 0 0 0 0 0 0 0 0 0 0 0 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
dst 3) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 1
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-325
Submit Documentation Feedback
4.149 EXT
— www.ti.com
A1 07A43F2Ah A1 07A43F2Ah
A2 xxxxxxxxh A2 FFFFF21Fh
Example 2
A1 03B6E7D5h A1 03B6E7D5h
A2 00000073h A2 00000073h
A3 xxxxxxxxh A3 000003B6h
4-326 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.150 EXTU
www.ti.com —
4.150 EXTU
Extract and Zero-Extend a Bit Field
or
31 29 28 27 23 22 18
17 13 12 8 7 6 5 4 3 2 1 0
csta cstb 0 0 0 0 1 0 s p
5 5 1 1
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 0 1 1 1 0 0 0 s p
5 1 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-327
Submit Documentation Feedback
4.150 EXTU
— www.ti.com
Description The field in src2, specified by csta and cstb, is extracted and zero extended to 32 bits.
The extract is performed by a shift left followed by an unsigned shift right. csta and cstb
are the amounts to shift left and shift right, respectively. This can be thought of in terms
of the LSB and MSB of the field to be extracted. Then csta = 31 - MSB of the field and
cstb = csta + LSB of the field. The shift left and shift right amounts may also be specified
as the ten LSBs of the src1 register with cstb being bits 0-4 and csta bits 5-9. In the
example below, csta is 12 and cstb is 11 + 12 = 23. Only the ten LSBs are valid for the
register version of the instruction. If any of the 22 MSBs are non-zero, the result is
invalid.
csta cstb - csta
src2 1) X X X X X X X X X X X X 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
2) 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X 0 0 0 0 0 0 0 0 0 0 0 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
dst 3) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
4-328 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.150 EXTU
www.ti.com —
A1 07A43F2Ah A1 07A43F2Ah
A2 xxxxxxxxh A2 0000121Fh
Example 2
A1 03B6E7D5h A1 03B6E7D5h
A2 00000156h A2 00000156h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-329
Submit Documentation Feedback
4.151 FADDDP
— www.ti.com
4.151 FADDDP
Fast Double-Precision Floating Point add
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
3 5 5 5 7 3
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x opfield 1 0 0 0 s p
3 5 5 5 6 4
Description src2 is added to src1. The result is placed in dst. This instruction is the fast version of
ADDDP, with smaller Delay Slots and Functional Unit Latency.
4-330 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.151 FADDDP
www.ti.com —
Special Cases
INPUTS
src1 src2 OUTPUT Config. Reg.
QNan QNaN NaN_out NAN1,NAN2
QNaN SNaN NaN_out INVAL,NAN1,NAN2
QNaN other 1 NaN_out NAN1
SNaN QNaN NaN_out INVAL,NAN1,NAN2
SNaN SNaN NaN_out INVAL,NAN1,NAN2
SNaN other1 NaN_out INVAL,NAN1
other1 QNaN NaN_out NAN2
other1 SNaN NaN_out INVAL,NAN2
+INF -INF NaN_out INVAL
+INF other2 +INF INFO
-INF +INF NaN_out INVAL
-INF other 3 -INF INFO
other2 +INF +INF INFO
other3 -INF +INF INFO
1. Includes +/-INF
2. Includes +INF
3. Includes -INF
Overflow Outputs:
1. Set the INEX bit and the OVER bit in configuration reigster.
2. Set the result as follows:
Underflow Outputs:
Zero Outputs:
a. If an add of two equal numbers opposite in sign is performed, the resulting
sign will be (+) unless the rounding is to -inf, in which case it will be (-).
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-331
Submit Documentation Feedback
4.151 FADDDP
— www.ti.com
b. If an add of two zeros, (or denormals), with the same sign is performed, the
sign of the output will be the sign of the inputs regardless of rounding mode.
(-0 + -0) -> -0
(+0 + +0) -> +0
Denormal Inputs:
Denormals will be flagged in the configuration register accordingly, and will cause the
result to be inexact unless the other operand is a NaN of Infinity. The denormal input
will be treated zero throughout the operation.
Delay Slots 2
Example A3 == 0x40240000
A2 == 0x00000000
A5 == 0x40460000
A4 == 0x00000000
FADDDP .L A3,A2,A5,A4,A1A0 ; 10 + 44 = 54
A1 == 0x404b0000
A0 == 0x00000000
FADCR== 0x00000000
A3 == 0x402fc106
A2 == 0x24dd2f1b
A5 == 0x401070a3
A4 == 0xd70a3d71
FADDDP .L A3,A2,A5,A4,A1A0
A1 == 0x4033fcac
A0 == 0x083126ea
FADCR= 0x00000080
FADCR== 0x02000000
B3 == 0x402fc106
B2 == 0x24dd2f1b
B5 == 0x401070a3
B4 == 0xd70a3d71
FADDDP .L B3,B2,B5,B4,B1B0
B1 == 0x4033fcac
B0 == 0x083126e9
FADCR= 0x02800000
FADCR== 0x00000400
A3 == 0x402fc106
A2 == 0x24dd2f1b
A5 == 0x401070a3
A4 == 0xd70a3d71
FADDDP .L A3,A2,A5,A4,A1A0
A1 == 0x4033fcac
A0 == 0x083126ea
FADCR= 0x00000480
FADCR== 0x06000000
B3 == 0x402fc106
B2 == 0x24dd2f1b
B5 == 0x401070a3
B4 == 0xd70a3d71
4-332 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.151 FADDDP
www.ti.com —
FADDDP .L B3,B2,B5,B4,B1B0
B1 == 0x4033fcac
B0 == 0x083126e9
FADCR= 0x06800000
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-333
Submit Documentation Feedback
4.152 FADDSP
— www.ti.com
4.152 FADDSP
Fast Single-Precision Floating Point Add
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
3 5 5 5 7 3
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 0 opfield 1 1 0 s p
3 5 5 5 4 3 3
Description src2 is added to src1. The result is placed in dst. This instruction is the fast version of
ADDSP, with smaller Delay Slots.
4-334 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.152 FADDSP
www.ti.com —
Special Cases
INPUTS
src1 src2 OUTPUT Config. Reg.
QNan QNaN NaN_out NAN1,NAN2
QNaN SNaN NaN_out INVAL,NAN1,NAN2
QNaN other 1 NaN_out NAN1
SNaN QNaN NaN_out INVAL,NAN1,NAN2
SNaN SNaN NaN_out INVAL,NAN1,NAN2
SNaN other1 NaN_out INVAL,NAN1
other1 QNaN NaN_out NAN2
other1 SNaN NaN_out INVAL,NAN2
+INF -INF NaN_out INVAL
+INF other 2 +INF INFO
-INF +INF NaN_out INVAL
-INF other 3 -INF INFO
other2 +INF +INF INFO
other3 -INF +INF INFO
1. Includes +/-INF
2. Includes +INF
3. Includes -INF
Overflow Outputs
1. Set the INEX bit and the OVER bit in configuration reigster.
2. Set the result as follows:
Underflow Outputs:
Zero outputs:
a. If an add of two equal numbers opposite in sign is performed, the resulting
sign will be (+) unless the rounding is to -inf, in which case it will be (-).
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-335
Submit Documentation Feedback
4.152 FADDSP
— www.ti.com
b. If an add of two zeros, (or denormals), with the same sign is performed, the
sign of the output will be the sign of the inputs regardless of rounding mode.
(-0 + -0) -> -0
(+0 + +0) -> +0
Denormal Inputs:
Denormals will be flagged in the configuration register accordingly, and will cause the
result to be inexact unless the other operand is a NaN of Infinity. The denormal input
will be treated zero throughout the operation.
Delay Slots 2
Example A2 == 0x41200000
A4 == 0x42300000
FADDSP .L A2,A4,A0 ; 10 + 44 = 54
A0 == 0x42580000
FADCR== 0x00000000
A2 == 0x40784517
A4 == 0x42aeab2d
FADDSP .L A2,A4,A0
A0 == 0x42b66d56
FADCR= 0x00000080
FADCR== 0x02000000
B2 == 0x40784517
B4 == 0x42aeab2d
FADDSP .L B2,B4,B0
B0 == 0x42b66d55
FADCR= 0x02800000
FADCR== 0x00000400
A2 == 0x40784517
A4 == 0x42aeab2d
FADDSP .L A2,A4,A0
A0 == 0x42b66d56
FADCR= 0x00000480
FADCR== 0x06000000
B2 == 0x40784517
B4 == 0x42aeab2d
FADDSP .L B2,B4,B0
B0 == 0x42b66d55
FADCR= 0x06800000
4-336 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.153 FMPYDP
www.ti.com —
4.153 FMPYDP
Fast Double-Precision Floating Point Multiply
31 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
3 5 5 5 5
Description src1 is multiply by src2 and the result is placed in dst. src1, src2 and dst are all double
precision floating point numbers stored in two consecutive registers
Special Cases:
1. If one source is SNaN or QNaN, the result is a signed NaN_out and the NANn bit
is set. If either source is SNaN, the INVAL bit is set also. The sign of NaN_out is
the XOR to the input signs.
2. Signed infinity multiplied by signed infinity or a normalized number (other than
signed zero) returns signed infinity. Signed infinity multiplied by signed zero (or
denormal) returns a signed NaN_out and sets the INVAL bit.
3. If one or both source are signed zero, the result is signed zero unless the other
source is a NaN or signed infinity, in which case the result is signed NaN_out.
4. If signed zero is multiplied by signed infinity, the result is signed NaN_out and
the INVAL bit is set.
5. A denormalized source is treated as signed zero and the DENn bit is set. The
INEX bit is set except when the other source is signed infinity, signed NaN, OR
signed zero. Therefore, a signed infinity multiplied by a denormalized number
gives a signed NaN_out and sets the INVAL bit.
6. If rounding is performed, the INEX bit is set.
Execution if(cond) src1 x src2 -> dst
else nop
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-337
Submit Documentation Feedback
4.153 FMPYDP
— www.ti.com
FMCR= 0x00000080
;RMODE(0) (8.6)*[-2.5]=[-21.5] RMODE(0)+INEX(1)
FMCR== 0x00000200
A3 == 0x40213333
A2 == 0x33333333
A5 == 0xc0040000
A4 == 0x00000000
FMPYDP .M1 A3:A2,A5:A4,A1:A0
A1 == 0xc0357fff
A0 == 0xffffffff
FMCR= 0x00000280
;RMODE(1) (8.6)*[-2.5]=(-21.5) RMODE(1)+INEX(1)
FMCR== 0x04000000
B3 == 0x40213333
B2 == 0x33333333
B5 == 0xc0040000
B4 == 0x00000000
FMPYDP .M2 B3:B2,B5:B4,B1:B0
B1 == 0xc0357fff
B0 == 0xffffffff
FMCR= 0x04800000
;RMODE(2) (8.6)*[-2.5]=(-21.5) RMODE(2)+INEX(1)
FMCR== 0x06000000
B3 == 0x40213333
B2 == 0x33333333
B5 == 0xc0040000
B4 == 0x00000000
FMPYDP .M2 B3:B2,B5:B4,B1:B0
B1 == 0xc0358000
B0 == 0x00000000
FMCR= 0x06800000
;RMODE(3) (8.6)*[-2.5]=(-21.5) RMODE(3)+INEX(1)
4-338 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.154 FSUBDP
www.ti.com —
4.154 FSUBDP
Fast Double-Precision Floating Point Subtract
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
3 5 5 5 7
31 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
3 5 5 5 4
else nop
Delay Slots 2
FADCR== 0x00000000
A3 == 0x41766489
A2 == 0x78903832
A5 == 0x40130877
A4 == 0x983fffff
FSUBDP . A3,A2,A5,A4,A1A0
A1 == 0x41766489
A0 == 0x2c6e59d1
23480471.535.. - (4.75826871) = 23480466.776.. RMODE(0)+INEX(1)
FADCR= 0x00000080
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-339
Submit Documentation Feedback
4.155 FSUBSP
— www.ti.com
4.155 FSUBSP
Fast Single-Precision Floating Point Subtract
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
3 5 5 5 7
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 5 4 3 2 1 0
3 5 5 5 3
Delay Slots 2
FADCR== 0x00000000
A2 == 0x40784517
A4 == 0xc2aeab2d
FSUBSP . A2,A4,A0
4-340 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.155 FSUBSP
www.ti.com —
A0 == 0x42b66d56
3.8792169094 - (-87.33432769) = 91.2135467529
FADCR= 0x00000080
FADCR== 0x02000000
B2 == 0x40784517
B4 == 0xc2aeab2d
FSUBSP . B2,B4,B0
B0 == 0x42b66d55
3.8792169094 - (-87.33432769) = 91.2135391235
FADCR= 0x02800000
FADCR== 0x00000400
A2 == 0x40784517
A4 == 0xc2aeab2d
FSUBSP . A2,A4,A0
A0 == 0x42b66d56
3.8792169094 - (-87.33432769) = 91.2135467529
FADCR= 0x00000480
FADCR== 0x06000000
B2 == 0x40784517
B4 == 0xc2aeab2d
FSUBSP . B2,B4,B0
B0 == 0x42b66d55
3.8792169094 - (-87.33432769) = 91.2135391235
FADCR= 0x06800000
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-341
Submit Documentation Feedback
4.156 GMPY
— www.ti.com
4.156 GMPY
Galois Field Multiply
Compatibility
Opcode
31 30 29 28 27 23 22 18
0 0 0 1 dst src2
5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 1 1 1 1 1 0 0 s p
5 1 1 1
Description Performs a Galois field multiply, where src1 is 32 bits and src2 is limited to 9 bits. This
utilizes the existing hardware and produces a 32-bit result. This multiply connects all
levels of the gmpy4 together and only extends out by 8 bits, the resulting data is XORed
down by the 32-bit polynomial.
The polynomial used comes from either the GPLYA or GPLYB control register
depending on which side (A or B) the instruction executes. If the A-side M1 unit is
used, the polynomial comes from GPLYA; if the B-side M2 unit, the polynomial comes
from GPLYB.
uint pp;
uint mask, tpp;
uint I;
pp = 0;
mask = 0x00000100; // multiply by computing
// partial products.
for ( I=0; i<8; I++ ){
if ( src2 & mask ) pp ^= src1;
mask >>= 1;
tpp = pp << 1;
if (pp & 0x80000000) pp = polynomial ^ tpp;
else pp = tpp;
}
if ( src2 & 0x1 ) pp ^= src1;
4-342 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.156 GMPY
www.ti.com —
Delay Slots 3
A0 12345678h A2 C721A0EFh
A1 00000126h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-343
Submit Documentation Feedback
4.157 GMPY4
— www.ti.com
4.157 GMPY4
Galois Field Multiply, Packed 8-Bit
Opcode
31 29 28 27 23 22 18
3 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 0 0 1 1 1 0 0 s p
5 1 1 1
Description Performs the Galois field multiply on four values in src1 with four parallel values in
src2. The four products are packed into dst. The values in both src1 and src2 are treated
as unsigned, 8-bit packed data.
For each pair of 8-bit quantities in src1 and src2, the unsigned, 8-bit value from src1 is
Galois field multiplied (gmpy) with the unsigned, 8-bit value from src2. The product of
src1 byte 0 and src2 byte 0 is written to byte0 of dst. The product of src1 byte 1 and src2
byte 1 is written to byte1 of dst. The product of src1 byte 2 and src2 byte 2 is written to
byte2 of dst. The product of src1 byte 3 and src2 byte 3 is written to the most-significant
byte in dst.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ←src1
GMPY4
= = = =
31 0
ua_3 gmpy ub_3 ua_2 gmpy ub_2 ua_1 gmpy ub_1 ua_0 gmpy ub_0 ←dst
The size and polynomial are controlled by the Galois field polynomial generator
function register (GFPGFR). All registers in the control register file can be written
using the MVC instruction (see MVC).
4-344 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.157 GMPY4
www.ti.com —
The default field generator polynomial is 1Dh, and the default size is 7. This setting is
used for many communications standards.
is equivalent to:
Execution if (cond) {
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
A5 45 23 00 01h 69 35 0 1 A5 45 23 00 01h
unsigned
A6 57 34 00 01h 87 52 0 1 A6 57 34 00 01h
unsigned
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-345
Submit Documentation Feedback
4.157 GMPY4
— www.ti.com
4-346 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.158 IDLE
www.ti.com —
4.158 IDLE
Multicycle NOP With No Termination Until Interrupt
Syntax IDLE
unit = none
Opcode
31 18 17 16
Reserved 0 1
14
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description Performs an infinite multicycle NOP that terminates upon servicing an interrupt, or a
branch occurs due to an IDLE instruction being in the delay slots of a branch.
The IDLE instruction cannot be paired with any other multicycle NOP instruction in
the same execute packet. Instructions that generate a multicycle NOP are: ADDKPC,
BNOP, and the multicycle NOP.
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-347
Submit Documentation Feedback
4.159 INTDP
— www.ti.com
4.159 INTDP
Convert Signed Integer to Double-Precision Floating-Point Value
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 1 1 1 0 0 1 1 1 0 s p
1 1 1
Description The signed integer value in src2 is converted to a double-precision value and placed in
dst.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read src2
Written dst_l dst_h
Unit in use .L
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Instruction Type INTDP
Delay Slots 4
4-348 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.159 INTDP
www.ti.com —
A1:A0 xxxx xxxxh xxxx xxxxh A1:A0 41B9 6511h 2700 0000h
4.2605393 E08
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-349
Submit Documentation Feedback
4.160 INTDPU
— www.ti.com
4.160 INTDPU
Convert Unsigned Integer to Double-Precision Floating-Point Value
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 1 1 1 0 1 1 1 1 0 s p
1 1 1
Description The unsigned integer value in src2 is converted to a double-precision value and placed
in dst.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read src2
Written dst_l dst_h
Unit in use .L
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Instruction Type INTDP
Delay Slots 4
4-350 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.160 INTDPU
www.ti.com —
A1:A0 xxxx xxxxh xxxx xxxxh A1:A0 41EF FFFFh FBC0 0000h
4.2949673 E09
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-351
Submit Documentation Feedback
4.161 INTSP
— www.ti.com
4.161 INTSP
Convert Signed Integer to Single-Precision Floating Point
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
3 5 5 5 7
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Description Converts the signed integer in src2 to a single precision floating point value. The INEX
bit is set if the mantissa was rounded.
Delay Slots 3
Functional Unit Latency 1
4-352 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.161 INTSP
www.ti.com —
See Also DPINT, DPTRUNC, INTDP, INTDPU, INTSP, SPINT, SPTRUNC
Example FADCR== 0x00000000
A2 == 0x19651127
INTSP .L A2,A0
A0 == 0x4dcb2889
FADCR= 0x00000080
FADCR== 0x02000000
B2 == 0x19651127
INTSP .L B2,B0
B0 == 0x4dcb2889
FADCR= 0x02800000
FADCR== 0x00000400
A2 == 0x19651127
INTSP .L A2,A0
A0 == 0x4dcb288a
FADCR= 0x00000480
FADCR== 0x06000000
B2 == 0x19651127
INTSP .L B2,B0
B0 == 0x4dcb2889
FADCR= 0x06800000
FADCR== 0x00000000
A2 == 0xffffffde
INTSP .L A2,A0
A0 == 0xc2080000
FADCR== 0x00000000
FADCR== 0x00000000
A2 == 0xffffffff
INTSP .L A2,A0
A0 == 0xbf800000
FADCR== 0x00000080
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-353
Submit Documentation Feedback
4.162 INTSPU
— www.ti.com
4.162 INTSPU
Convert Unsigned Integer to Single-Precision Floating Point
3 5 5 5 7
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
3 5 5 5
Description Converts the unsigned integer in src2 to a single precision floating point value. The
INEX bit is set if the mantissa was rounded.
Execution if(cond) sp(uint(src2)) -> dst
else nop
Delay Slots 3
4-354 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.162 INTSPU
www.ti.com —
Example FADCR== 0x00000000
A2 == 0x19651127
INTSPU .L A2,A0
A0 == 0x4dcb2889
FADCR= 0x00000080
FADCR== 0x02000000
B2 == 0x19651127
INTSPU .L B2,B0
B0 == 0x4dcb2889
FADCR= 0x02800000
FADCR== 0x00000400
A2 == 0x19651127
INTSPU .L A2,A0
A0 == 0x4dcb288a
FADCR= 0x00000480
FADCR== 0x06000000
B2 == 0x19651127
INTSPU .L B2,B0
B0 == 0x4dcb2889
FADCR= 0x06800000
FADCR== 0x00000000
A2 == 0xffffffde
INTSPU .L A2,A0
A0 == 0x4f800000
FADCR= 0x00000080
FADCR== 0x00000000
A2 == 0xffffffff
INTSPU .L A2,A0
A0 == 0x4f800000
FADCR== 0x00000080
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-355
Submit Documentation Feedback
4.163 LAND
— www.ti.com
4.163 LAND
Logical AND
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
3 5 5 5 7
Description The LAND instruction performs logical AND between two source registers. If the two
operands are not zero, the result is one. Otherwise the result is zero. The result is stored
in the destination register
Execution If (cond) {
dst = 1
else dst = 0
Delay Slots 0
A0 == 0x00000000
A0 == 0x00005678
LAND .L A0,A0,A15
A15 == 0x00000000
4-356 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.164 LANDN
www.ti.com —
4.164 LANDN
Logical AND, One Operand Negated
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
3 5 5 5 7
Description The LANDN instruction performs logical AND between a source register and the NOT
of a second register.
Execution If (cond) {
dst = 1
else dst = 0
Delay Slots 0
Example A0 == 0x12340000
A0 == 0x00005678
LANDN .L A0,A0,A15
A15 == 0x00000000
A0 == 0x12340000
A0 == 0x00000000
LANDN .L A0,A0,A15
A15 == 0x00000001
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-357
Submit Documentation Feedback
4.165 LDB(U)
— www.ti.com
4.165 LDB(U)
Load Byte From Memory With a 5-Bit Unsigned Constant Offset or Register Offset
Opcode
31 29 28 27 23 22 18
creg z dst baseR
3 1 5 5
17 13 12 9 8 7 6 4 3 2 1 0
offsetR/ucst5 mode 0 y op 0 1 s p
5 4 1 3 1 1
Description Loads a byte from memory to a general-purpose register (dst). Table 4-6 summarizes
the data types supported by loads. Table 3-12 on page 3-30 describes the addressing
generator options. The memory address is formed from a base address register (baseR)
and an optional offset that is either a register (offsetR) or a 5-bit unsigned constant
(ucst5). If an offset is not given, the assembler assigns an offset of zero.
Table 4-6 Data Types Supported by LDB(U) Instruction
Mnemonic op Field Load Data Type SIze Left Shift of Offset
LDB 0 1 0 Load byte 8 0 bits
LDBU 0 0 1 Load byte unsigned 8 0 bits
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
4-358 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.165 LDB(U)
www.ti.com —
For LDB(U), the values are loaded into the 8 LSBs of dst. For LDB, the upper 24 bits of
dst values are sign-extended; for LDBU, the upper 24 bits of dst are zero-filled. The s bit
determines which file dst will be loaded into: s = 0 indicates dst will be loaded in the A
register file and s = 1 indicates dst will be loaded in the B register file. The r bit should
be cleared to 0.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR, offsetR
Written baseR dst
Unit in use .D
For more information on delay slots for a load, see Chapter 5 ‘‘Pipeline’’ on page 5-1.
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-359
Submit Documentation Feedback
4.165 LDB(U)
— www.ti.com
mem 4000h 0112 2334h mem 4000h 0112 2334h mem 4000h 0112 2334h
mem 4004h 4556 6778h mem 4004h 4556 6778h mem 4004h 4556 6778h
Example 3
LDB .D1 *A4++[5],A8
mem 4000h 0112 2334h mem 4000h 0112 2334h mem 4000h 0112 2334h
mem 4004h 4556 6778h mem 4004h 4556 6778h mem 4004h 4556 6778h
Example 4
4-360 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.165 LDB(U)
www.ti.com —
mem 4000h 0112 2334h mem 4000h 0112 2334h mem 4000h 0112 2334h
mem 4004h 4556 6778h mem 4004h 4556 6778h mem 4004h 4556 6778h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-361
Submit Documentation Feedback
4.166 LDB(U)
— www.ti.com
4.166 LDB(U)
Load Byte From Memory With a 15-Bit Unsigned Constant Offset
or
unit = .D2
Opcode
31 29 28 27 23
creg z dst
3 1 5
22 8 7 6 4 3 2 1 0
ucst15 y op 1 1 s p
15 1 3 1 1
Description Loads a byte from memory to a general-purpose register (dst). Table 4-7 summarizes
the data types supported by loads. The memory address is formed from a base address
register B14 (y = 0) or B15 (y = 1) and an offset, which is a 15-bit unsigned constant
(ucst15). The assembler selects this format only when the constant is larger than five
bits in magnitude. This instruction operates only on the .D2 unit.
The offset, ucst15, is scaled by a left shift of 0 bits. After scaling, ucst15 is added to baseR.
Subtraction is not supported. The result of the calculation is the address sent to
memory. The addressing arithmetic is always performed in linear mode.
For LDB(U), the values are loaded into the 8 LSBs of dst. For LDB, the upper 24 bits of
dst values are sign-extended; for LDBU, the upper 24 bits of dst are zero-filled. The s bit
determines which file dst will be loaded into: s = 0 indicates dst will be loaded in the A
register file and s = 1 indicates dst will be loaded in the B register file.
4-362 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.166 LDB(U)
www.ti.com —
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read B14/B15
Written dst
Unit in use .D2
Delay Slots 4
B1 xxxxxxxxh B1 xxxxxxxxh
B1 0000 0012h
B14 00000100h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-363
Submit Documentation Feedback
4.167 LDDW
— www.ti.com
4.167 LDDW
Load Doubleword From Memory With a 5-Bit Unsigned Constant Offset or
Register Offset
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 9 8 7 6 5 4 3 2 1 0
offsetR/ucst5 mode 1 y 1 1 0 0 1 s p
5 4 1 1 1
Description Loads a 64-bit quantity from memory into a register pair dst_o:dst_e.Table 3-12 on
page 3-30 describes the addressing generator options. The memory address is formed
from a base address register (baseR) and an optional offset that is either a register
(offsetR) or a 5-bit unsigned constant (ucst5).
Both offsetR and baseR must be in the same register file and on the same side as the .D
unit used. The y bit in the opcode determines the .D unit and the register file used: y =
0 selects the .D1 unit and the baseR and offsetR from the A register file, and y = 1 selects
the .D2 unit and baseR and offsetR from the B register file. The s bit determines the
register file into which the dst is loaded: s = 0 indicates that dst is in the A register file,
and s = 1 indicates that dst is in the B register file. The r bit has a value of 1 for the
LDDW instruction. The dst field must always be an even value because the LDDW
instruction loads register pairs. Therefore, bit 23 is always zero.
4-364 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.167 LDDW
www.ti.com —
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
The destination register pair must consist of a consecutive even and odd register pair
from the same register file. The instruction can be used to load a double-precision
floating-point value (64 bits), a pair of single-precision floating-point words (32 bits),
or a pair of 32-bit integers. The 32 least-significant bits are loaded into the
even-numbered register and the 32 most-significant bits (containing the sign bit and
exponent) are loaded into the next register (which is always odd-numbered register).
The register pair syntax places the odd register first, followed by a colon, then the even
register (that is, A1:A0, B1:B0, A3:A2, B3:B2, etc.).
All 64 bits of the double-precision floating point value are stored in big- or little-endian
byte order, depending on the mode selected. When the LDDW instruction is used to
load two 32-bit single-precision floating-point values or two 32-bit integer values, the
order is dependent on the endian mode used. In little-endian mode, the first 32-bit
word in memory is loaded into the even register. In big-endian mode, the first 32-bit
word in memory is loaded into the odd register. Regardless of the endian mode, the
doubleword address must be on a doubleword boundary (the three LSBs are zero).
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR, offsetR
Written baseR dst
Unit in use .D
Delay Slots 4
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-365
Submit Documentation Feedback
4.167 LDDW
— www.ti.com
Example 2
Example 3
mem 40B0h 0112 2334h 4556 6778h mem 40B0h 0112 2334h 4556 6778h
A4 0000 40B0h
4-366 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.167 LDDW
www.ti.com —
Example 4
mem 40E0h 0112 2334h 4556 6778h 8 mem 40E0h 0112 2334h 4556 6778h
A4 0000 40E0h
Example 5
mem 40C0h 4556 6778h 899A ABBCh mem 40C0h 4556 6778h 899A ABBCh
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-367
Submit Documentation Feedback
4.167 LDDW
— www.ti.com
A4 0000 40C0h
4-368 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.168 LDH(U)
www.ti.com —
4.168 LDH(U)
Load Halfword From Memory With a 5-Bit Unsigned Constant Offset or
Register Offset
Opcode
31 29 28 27 23 22 18
creg z dst baseR
3 1 5 5
17 13 12 9 8 7 6 4 3 2 1 0
offsetR/ucst5 mode 0 y op 0 1 s p
5 4 1 3 1 1
Description Loads a halfword from memory to a general-purpose register (dst). Table 4-8
summarizes the data types supported by halfword loads. Table 3-12 on page 3-30
describes the addressing generator options. The memory address is formed from a base
address register (baseR) and an optional offset that is either a register (offsetR) or a 5-bit
unsigned constant (ucst5). If an offset is not given, the assembler assigns an offset of
zero.
Table 4-8 Data Types Supported by LDH(U) Instruction
Mnemonic op Field Load Data Type SIze Left Shift of Offset
LDH 1 0 0 Load halfword 16 1 bit
LDHU 0 0 0 Load halfword unsigned 16 1 bit
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-369
Submit Documentation Feedback
4.168 LDH(U)
— www.ti.com
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
For LDH(U), the values are loaded into the 16 LSBs of dst. For LDH, the upper 16 bits
of dst are sign-extended; for LDHU, the upper 16 bits of dst are zero-filled. The s bit
determines which file dst will be loaded into: s = 0 indicates dst will be loaded in the A
register file and s = 1 indicates dst will be loaded in the B register file. The r bit should
be cleared to 0.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR, offsetR
Written baseR dst
Unit in use .D
For more information on delay slots for a load, see Chapter 5 ‘‘Pipeline’’ on page 5-1.
4-370 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.168 LDH(U)
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-371
Submit Documentation Feedback
4.169 LDH(U)
— www.ti.com
4.169 LDH(U)
Load Halfword From Memory With a 15-Bit Unsigned Constant Offset
or
unit = .D2
Opcode
31 29 28 27 23
creg z dst
3 1 5
22 8 7 6 4 3 2 1 0
ucst15 y op 1 1 s p
15 1 3 1 1
Description Loads a halfword from memory to a general-purpose register (dst). Table 4-9
summarizes the data types supported by loads. The memory address is formed from a
base address register B14 (y = 0) or B15 (y = 1) and an offset, which is a 15-bit unsigned
constant (ucst15). The assembler selects this format only when the constant is larger
than five bits in magnitude. This instruction operates only on the .D2 unit.
The offset, ucst15, is scaled by a left shift of 1 bit. After scaling, ucst15 is added to baseR.
Subtraction is not supported. The result of the calculation is the address sent to
memory. The addressing arithmetic is always performed in linear mode.
For LDH(U), the values are loaded into the 16 LSBs of dst. For LDH, the upper 16 bits
of dst are sign-extended; for LDHU, the upper 16 bits of dst are zero-filled. The s bit
determines which file dst will be loaded into: s = 0 indicates dst will be loaded in the A
register file and s = 1 indicates dst will be loaded in the B register file.
4-372 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.169 LDH(U)
www.ti.com —
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read B14/B15
Written dst
Unit in use .D2
Delay Slots 4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-373
Submit Documentation Feedback
4.170 LDNDW
— www.ti.com
4.170 LDNDW
Load Nonaligned Doubleword From Memory With Constant or Register Offset
Opcode
31 29 28 27 24 23 22 18
17 13 12 9 8 7 6 5 4 3 2 1 0
offsetR/ucst5 mode 1 y 0 1 0 0 1 s p
5 4 1 1 1
Description Loads a 64-bit quantity from memory into a register pair, dst_o:dst_e. Table 3-12 on
page 3-30 describes the addressing generator options. The LDNDW instruction may
read a 64-bit value from any byte boundary. Thus alignment to a 64-bit boundary is not
required. The memory address is formed from a base address register (baseR) and an
optional offset that is either a register (offsetR) or a 5-bit unsigned constant (ucst5).
Both offsetR and baseR must be in the same register file, and on the same side, as the .D
unit used. The y bit in the opcode determines the .D unit and register file used: y = 0
selects the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the
.D2 unit and baseR and offsetR from the B register file.
The LDNDW instruction supports both scaled offsets and nonscaled offsets. The sc
field is used to indicate whether the offsetR/ucst5 is scaled or not. If sc is 1 (scaled), the
offsetR/ucst5 is shifted left 3 bits before adding or subtracting from the baseR. If sc is 0
(nonscaled), the offsetR/ucst5 is not shifted before adding or subtracting from the
baseR. For the preincrement, predecrement, positive offset, and negative offset address
generator options, the result of the calculation is the address to be accessed in memory.
For postincrement or postdecrement addressing, the value of baseR before the addition
or subtraction is the address to be accessed from memory.
4-374 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.170 LDNDW
www.ti.com —
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
The dst field of the instruction selects a register pair, a consecutive even-numbered and
odd-numbered register pair from the same register file. The instruction can be used to
load a pair of 32-bit integers. The 32 least-significant bits are loaded into the
even-numbered register and the 32 most-significant bits are loaded into the next
register (that is always an odd-numbered register).
The dst can be in either register file, regardless of the .D unit or baseR or offsetR used.
The s bit determines which file dst will be loaded into: s = 0 indicates dst will be in the
A register file and s = 1 indicates dst will be loaded in the B register file. The r bit has a
value of 1 for the LDNDW instruction.
Parentheses, ( ), can be used to tell the assembler that the offset is a non-scaled offset.
For example, LDNDW (.unit) *+baseR (14), dst represents an offset of 14 bytes, and the
assembler writes out the instruction with offsetC = 14 and sc = 0.
LDNDW (.unit) *+baseR [16], dst represents an offset of 16 doublewords, or 128 bytes,
and the assembler writes out the instruction with offsetC = 16 and sc = 1.
Either brackets or parentheses must be typed around the specified offset if the optional
offset parameter is used.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR,offsetR
Written baseR dst
Unit in use .D
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-375
Submit Documentation Feedback
4.170 LDNDW
— www.ti.com
A0 0000 1009h
Byte
Memory
Address 100C 100B 100A 1009 1008 1007 1006 1005 1004 1003 1002 1001 1000
Data Value 11 05 69 34 5E 1C 4F 29 A8 12 B6 C5 D4
Example 2
4-376 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.170 LDNDW
www.ti.com —
A0 0000 100Bh
Byte
Memory
Address 100C 100B 100A 1009 1008 1007 1006 1005 1004 1003 1002 1001 1000
Data Value 11 05 69 34 5E 1C 4F 29 A8 12 B6 C5 D4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-377
Submit Documentation Feedback
4.171 LDNW
— www.ti.com
4.171 LDNW
Load Nonaligned Word From Memory With Constant or Register Offset
Opcode
31 29 28 27 23 22 18
17 13 12 9 8 7 6 5 4 3 2 1 0
offsetR/ucst5 mode 1 y 0 1 1 0 1 s p
5 4 1 1 1
Description Loads a 32-bit quantity from memory into a 32-bit register, dst. Table 3-12 on
page 3-30 describes the addressing generator options. The LDNW instruction may
read a 32-bit value from any byte boundary. Thus alignment to a 32-bit boundary is not
required. The memory address is formed from a base address register (baseR), and an
optional offset that is either a register (offsetR) or a 5-bit unsigned constant (ucst5). If
an offset is not given, the assembler assigns an offset of zero.
Both offsetR and baseR must be in the same register file, and on the same side, as the .D
unit used. The y bit in the opcode determines the .D unit and register file used: y = 0
selects the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the
.D2 unit and baseR and offsetR from the B register file.
The offsetR/ucst5 is scaled by a left shift of 2 bits. After scaling, offsetR/ucst5 is added to,
or subtracted from, baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is the address to be accessed from memory.
4-378 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.171 LDNW
www.ti.com —
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
The dst can be in either register file, regardless of the .D unit or baseR or offsetR used.
The s bit determines which file dst will be loaded into: s = 0 indicates dst will be in the
A register file and s = 1 indicates dst will be loaded in the B register file. The r bit has a
value of 1 for the LDNW instruction.
Parentheses, ( ), can be used to tell the assembler that the offset is a nonscaled, constant
offset. The assembler right shifts the constant by 2 bits for word loads before using it
for the ucst5 field. After scaling by the LDNW instruction, this results in the same
constant offset as the assembler source if the least-significant two bits are zeros.
For example, LDNW (.unit) *+baseR (12), dst represents an offset of 12 bytes (3 words),
and the assembler writes out the instruction with ucst5 = 3.
LDNW (.unit) *+baseR [12], dst represents an offset of 12 words, or 48 bytes, and the
assembler writes out the instruction with ucst5 = 12.
Either brackets or parentheses must be typed around the specified offset if the optional
offset parameter is used.
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR,offsetR
Written baseR dst
Unit in use .D
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-379
Submit Documentation Feedback
4.171 LDNW
— www.ti.com
mem 1000h 12B6 C5D4h mem 1000h 12B6 C5D4h mem 1000h 12B6 C5D4h
mem 1004h 1C4F 29A8h mem 1004h 1C4F 29A8h mem 1004h 1C4F 29A8h
Byte Memory Address 1007 1006 1005 1004 1003 1002 1001 1000
Data Value 1C 4F 29 A8 12 B6 C5 D4
Example 2
mem 1000h 12B6 C5D4h mem 1000h 12B6 C5D4h mem 1000h 12B6 C5D4h
mem 1004h 1C4F 29A8h mem 1004h 1C4F 29A8h mem 1004h 1C4F 29A8h
Byte Memory Address 1007 1006 1005 1004 1003 1002 1001 1000
Data Value 1C 4F 29 A8 12 B6 C5 D4
4-380 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.172 LDW
www.ti.com —
4.172 LDW
Load Word From Memory With a 5-Bit Unsigned Constant Offset or Register Offset
Opcode
31 29 28 27 23 22 18
17 13 12 9 8 7 6 5 4 3 2 1 0
offsetR/ucst5 mode 0 y 1 1 0 0 1 s p
5 4 1 1 1
Description Loads a word from memory to a general-purpose register (dst). Table 3-12 on
page 3-30 describes the addressing generator options. The memory address is formed
from a base address register (baseR) and an optional offset that is either a register
(offsetR) or a 5-bit unsigned constant (ucst5). If an offset is not given, the assembler
assigns an offset of zero.
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
For LDW, the entire 32 bits fills dst. dst can be in either register file, regardless of the
.D unit or baseR or offsetR used. The s bit determines which file dst will be loaded into:
s = 0 indicates dst will be loaded in the A register file and s = 1 indicates dst will be
loaded in the B register file. The r bit should be cleared to 0.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-381
Submit Documentation Feedback
4.172 LDW
— www.ti.com
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR, offsetR
Written baseR dst
Unit in use .D
For more information on delay slots for a load, see Chapter 5 ‘‘Pipeline’’ on page 5-1.
Examples Example 1
mem 100h 21F3 1996h mem 100h 21F3 1996h mem 100h 21F3 1996h
Example 2
4-382 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.172 LDW
www.ti.com —
mem 100h 0798 F25Ah mem 100h 0798 F25Ah mem 100h 0798 F25Ah
mem 104h 1970 19F3h mem 104h 1970 19F3h mem 104h 1970 19F3h
Example 3
LDW .D1 *++A4[1],A6
mem 104h 0217 6991h mem 104h 0217 6991h mem 104h 0217 6991h
Example 4
mem 40C8h DCCB BAA8h mem 40C8h DCCB BAA8h mem 40C8h DCCB BAA8h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-383
Submit Documentation Feedback
4.172 LDW
— www.ti.com
Example 5
mem 40B8h 9AAB BCCDh mem 40B8h 9AAB BCCDh mem 40B8h 9AAB BCCDh
4-384 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.173 LDW
www.ti.com —
4.173 LDW
Load Word From Memory With a 15-Bit Unsigned Constant Offset
unit = .D2
Opcode
31 29 28 27 23
creg z dst
3 1 5
22 8 7 6 5 4 3 2 1 0
ucst15 y 1 1 0 1 1 s p
15 1 1 1
Description Load a word from memory to a general-purpose register (dst). The memory address is
formed from a base address register B14 (y = 0) or B15 (y = 1) and an offset, which is a
15-bit unsigned constant (ucst15). The assembler selects this format only when the
constant is larger than five bits in magnitude. This instruction operates only on the .D2
unit.
The offset, ucst15, is scaled by a left shift of 2 bits. After scaling, ucst15 is added to baseR.
Subtraction is not supported. The result of the calculation is the address sent to
memory. The addressing arithmetic is always performed in linear mode.
For LDW, the entire 32 bits fills dst. dst can be in either register file. The s bit
determines which file dst will be loaded into: s = 0 indicates dst will be loaded in the A
register file and s = 1 indicates dst will be loaded in the B register file.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-385
Submit Documentation Feedback
4.173 LDW
— www.ti.com
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read B14/B15
Written dst
Unit in use .D2
Delay Slots 4
4-386 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.174 LMBD
www.ti.com —
4.174 LMBD
Leftmost Bit Detection
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 5 4 3 2 1 0
src1/cst5 x op 1 1 0 s p
5 1 7 1 1
Description The LSB of the src1 operand determines whether to search for a leftmost 1 or 0 in src2.
The number of bits to the left of the first 1 or 0 when searching for a 1 or 0, respectively,
is placed in dst.
The following diagram illustrates the operation of LMBD for several cases.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
0 1 x x x x x x x x x x x x x x
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
x x x x x x x x x x x x x x x x
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
0 0 0 0 1 x x x x x x x x x x x
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-387
Submit Documentation Feedback
4.174 LMBD
— www.ti.com
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
x x x x x x x x x x x x x x x x
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Execution if (cond) {
if (src10 == 0), lmb0(src2) → dstif (src10 == 1), lmb1(src2) → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
A1 00000001h A1 00000001h
A2 009E3A81h A2 009E3A81h
A3 xxxxxxxxh A3 00000008h
4-388 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.175 LOR
www.ti.com —
4.175 LOR
Logical OR
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
3 5 5 5 7
Description The LOR instruction performs logical OR between two source registers. If either of the
operands has a nonzero value, the result has the value 1. Otherwise, the result has the
value 0.
Execution If (cond) {
dst = 1
else dst = 0
Delay Slots 0
Example A0 == 0x12340000
A0 == 0x00005678
LOR .L A0,A0,A15
A15 == 0x00000001
A0 == 0x00000000
A0 == 0x00000000
LOR .L A0,A0,A15
A15 == 0x00000000
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-389
Submit Documentation Feedback
4.176 MAX2
— www.ti.com
4.176 MAX2
Maximum, Signed, Packed 16-Bit
Opcode .L unit
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 0 0 1 0 1 1 0 s p
5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 1 0 1 1 1 0 0 s p
5 1 1 1
Description Performs a maximum operation on signed, packed 16-bit values. For each pair of
signed 16-bit values in src1 and src2, MAX2 places the larger value in the corresponding
position in dst.
4-390 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.176 MAX2
www.ti.com —
31 16 15 0
a_hi a_lo ←src1
MAX2
↓ ↓
31 16 15 0
(a_hi>b_hi) ? a_hi:b_hi (a_lo>b_lo) ? a_lo:b_lo ←dst
Execution if (cond) {
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
Example 2
MAX2 .L2X A2, B8, B12
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-391
Submit Documentation Feedback
4.176 MAX2
— www.ti.com
Example 3
Example 4
4-392 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.177 MAXU4
www.ti.com —
4.177 MAXU4
Maximum, Unsigned, Packed 8-Bit
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 0 0 1 1 1 1 0 s p
5 1 1 1
Description Performs a maximum operation on unsigned, packed 8-bit values. For each pair of
unsigned 8-bit values in src1 and src2, MAXU4 places the larger value in the
corresponding position in dst.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ←src1
MAXU4
↓ ↓ ↓ ↓
31 24 23 16 15 8 7 0
ua_3 > ub_3 ? ua_2 > ub_2 ? ua_1 > ub_1 ? ua_0 > ub_0 ? ←dst
ua_3:ub_3 ua_2:ub_2 ua_1:ub_1 ua_0:ub_0
Execution if (cond) {
if (ubyte0(src1) >= ubyte0(src2)), ubyte0(src1) → ubyte0(dst)
else ubyte0(src2) → ubyte0(dst);
if (ubyte1(src1) >= ubyte1(src2)), ubyte1(src1) → ubyte1(dst)
else ubyte1(src2) → ubyte1(dst);
if (ubyte2(src1) >= ubyte2(src2)), ubyte2(src1) → ubyte2(dst)
else ubyte2(src2) → ubyte2(dst);
if (ubyte3(src1) >= ubyte3(src2)), ubyte3(src1) → ubyte3(dst)
else ubyte3(src2) → ubyte3(dst)
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-393
Submit Documentation Feedback
4.177 MAXU4
— www.ti.com
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
Example 2
4-394 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.178 MFENCE
www.ti.com —
4.178 MFENCE
Memory Fence
Syntax MFENCE
31 30 29 28 27 18 17 16 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 parameter 0 opcode 0 0 0 0 0 0 0 0 0 0 0 s p
10 4
Description The MFENCE instruction stalls the instruction fetch pipeline until the memory system
busy flag goes low.
The instruction will ALWAYS wait at least 5 clock cycles before checking the busy flag
in order to account for pipeline delays on where we are sampling the
mem_to_cpu_busy signal.
E.g., this code will wait until the STW data has completed
STW A0, *A1
MFENCE ; This will wait until the STW write above has landed in it's final
destination
During the course of executing an MFENCE operation, any enabled interrupts will still
be serviced. IRP or NRP will be set to the PC of the execute packet containing the
MFENCE instruction (NOTE: This is different than how IDLEs and Multi-cycle NOPs
are handled)
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-395
Submit Documentation Feedback
4.178 MFENCE
— www.ti.com
{
nop ;
}
Delay Slots 0
See Also
Example
4-396 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.179 MIN2
www.ti.com —
4.179 MIN2
Minimum, Signed, Packed 16-Bit
Opcode .L unit
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 0 0 0 1 1 1 0 s p
5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 1 0 0 1 1 0 0 s p
5 1 1 1
Description Performs a minimum operation on signed, packed 16-bit values. For each pair of
signed 16-bit values in src1 and src2, MIN2 instruction places the smaller value in the
corresponding position in dst.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-397
Submit Documentation Feedback
4.179 MIN2
— www.ti.com
31 16 15 0
a_hi a_lo ←src1
MIN2
↓ ↓
31 16 15 0
(a_hi<b_hi) ? a_hi:b_hi (a_lo<b_lo) ? a_lo:b_lo ←dst
Execution if (cond) {
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
Example 2
MIN2 .L2X A2, B8, B12
4-398 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.179 MIN2
www.ti.com —
Example 3
Example 4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-399
Submit Documentation Feedback
4.180 MINU4
— www.ti.com
4.180 MINU4
Minimum, Unsigned, Packed 8-Bit
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 1 0 0 0 1 1 0 s p
5 1 1 1
Description Performs a minimum operation on unsigned, packed 8-bit values. For each pair of
unsigned 8-bit values in src1 and src2, MINU4 places the smaller value in the
corresponding position in dst.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ←src1
MINU4
↓ ↓ ↓ ↓
31 24 23 16 15 8 7 0
ua_3 < ub_3 ? ua_2 < ub_2 ? ua_1 < ub_1 ? ua_0 < ub_0 ? ←dst
ua_3:ub_3 ua_2:ub_2 ua_1:ub_1 ua_0:ub_0
Execution if (cond) {
if (ubyte0(src1) <= ubyte0(src2)), ubyte0(src1) → ubyte0(dst)
else ubyte0(src2) → ubyte0(dst);
if (ubyte1(src1) <= ubyte1(src2)), ubyte1(src1) → ubyte1(dst)
else ubyte1(src2) → ubyte1(dst);
if (ubyte2(src1) <= ubyte2(src2)), ubyte2(src1) → ubyte2(dst)
else ubyte2(src2) → ubyte2(dst);
if (ubyte3(src1) <= ubyte3(src2)), ubyte3(src1) → ubyte3(dst)
else ubyte3(src2) → ubyte3(dst)
4-400 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.180 MINU4
www.ti.com —
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-401
Submit Documentation Feedback
4.181 MPY
— www.ti.com
4.181 MPY
Multiply Signed 16 LSB × Signed 16 LSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 7 6 5 4 3 2 1 0
src1 x op 0 0 0 0 0 s p
5 1 5 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are signed by default.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
MPY .M1 A1,A2,A3
4-402 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.181 MPY
www.ti.com —
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-403
Submit Documentation Feedback
4.182 MPY2
— www.ti.com
4.182 MPY2
Multiply Signed by Signed, 16 LSB × 16 LSB and 16 MSB × 16 MSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 0 0 0 1 1 0 0 s p
5 1 1 1
Description Performs two 16-bit by 16-bit multiplications between two pairs of signed, packed
16-bit values. The values in src1 and src2 are treated as signed, packed 16-bit quantities.
The two 32-bit results are written into a 64-bit register pair.
The product of the lower halfwords of src1 and src2 is written to the even destination
register, dst_e. The product of the upper halfwords of src1 and src2 is written to the odd
destination register, dst_o.
This instruction helps reduce the number of instructions required to perform two
16-bit by 16-bit multiplies on both the lower and upper halves of two registers.
31 16 15 0
a_hi a_lo ←src1
× ×
MPY2
63 32 31 0
a_hi × b_hi a_lo × b_lo ←dst_o:dst_e
4-404 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.182 MPY2
www.ti.com —
Execution if (cond) {
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Examples Example 1
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 DF6A B0A8h 0775 462Ch
-546,656,088 125,126,188
Example 2
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 026A D5CCh 1091 7E81h
40,555,980 277,970,561
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-405
Submit Documentation Feedback
4.183 MPY2IR
— www.ti.com
4.183 MPY2IR
Multiply Two 16-Bit × 32-Bit, Shifted by 15 to Produce a Rounded 32-Bit Result
Compatibility
Opcode
31 30 29 28 27 23 22 18
0 0 0 1 dst src2
5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 1 1 1 1 0 0 s p
5 1 1 1
Description Performs two 16-bit by 32-bit multiplies. The upper and lower halves of src1 are treated
as 16-bit signed inputs. The value in src2 is treated as a 32-bit signed value. The
products are then rounded to a 32-bit result by adding the value 214 and then these
sums are right shifted by 15. The lower 32 bits of the two results are written into
dst_o:dst_e.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written
one cycle after the results are written to dst_o:dst_e.
Note—In the overflow case, where the 16-bit input to the MPYIR operation is
8000h and the 32-bit input is 8000 0000h, the saturation value 7FFF FFFFh is
written into the corresponding 32-bit dst register.
Execution if (msb16(src1) = 8000h && src2 = 80000000h), 7FFFFFFFh → dst_o
Delay Slots 3
Examples Example 1
4-406 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.183 MPY2IR
www.ti.com —
\
B2 80008001h B8 7FFF0000h
B5 80000000h B9 7FFFFFFFh
Example 2
MPY2IR .M1X A2,B5,A9:A8
A2 87654321h A8 098C16C1h
B5 12345678h A9 EED8E38Fh
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-407
Submit Documentation Feedback
4.184 MPY32
— www.ti.com
4.184 MPY32
Multiply Signed 32-Bit × Signed 32-Bit Into 32-Bit Result
Compatibility
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 0 0 0 0 0 0 0 s p
5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 and src2 are signed 32-bit values. Only the
lower 32 bits of the 64-bit result are written to dst.
else nop
Delay Slots 3
4-408 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.185 MPY32
www.ti.com —
4.185 MPY32
Multiply Signed 32-Bit × Signed 32-Bit Into Signed 64-Bit Result
Compatibility
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 0 0 0 0 0 0 0 s p
5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 and src2 are signed 32-bit values. The signed
64-bit result is written to the register pair specified by dst.
else nop
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-409
Submit Documentation Feedback
4.186 MPY32SU
— www.ti.com
4.186 MPY32SU
Multiply Signed 32-Bit × Unsigned 32-Bit Into Signed 64-Bit Result
Compatibility
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 1 0 0 0 0 0 0 s p
5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 is a signed 32-bit value and src2 is an
unsigned 32-bit value. The signed 64-bit result is written to the register pair specified
by dst.
else nop
Delay Slots 3
4-410 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.187 MPY32U
www.ti.com —
4.187 MPY32U
Multiply Unsigned 32-Bit × Unsigned 32-Bit Into Unsigned 64-Bit Result
Compatibility
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 0 0 0 1 1 0 0 s p
5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 and src2 are unsigned 32-bit values. The
unsigned 64-bit result is written to the register pair specified by dst.
else nop
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-411
Submit Documentation Feedback
4.188 MPY32US
— www.ti.com
4.188 MPY32US
Multiply Unsigned 32-Bit × Signed 32-Bit Into Signed 64-Bit Result
Compatibility
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 0 0 1 1 1 0 0 s p
5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 is an unsigned 32-bit value and src2 is a
signed 32-bit value. The signed 64-bit result is written to the register pair specified by
dst.
else nop
Delay Slots 3
4-412 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.189 MPYDP
www.ti.com —
4.189 MPYDP
Multiply Two Double-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 1 0 0 0 0 0 0 s p
5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst.
Note—
1) If one source is SNaN or QNaN, the result is a signed NaN_out. If either source
is SNaN, the INVAL bit is set also. The sign of NaN_out is the exclusive-OR of
the input signs.
2) Signed infinity multiplied by signed infinity or a normalized number (other
than signed 0) returns signed infinity. Signed infinity multiplied by signed 0
returns a signed NaN_out and sets the INVAL bit.
3) If one or both sources are signed 0, the result is signed 0 unless the other source
is NaN or signed infinity, in which case the result is signed NaN_out.
4) A denormalized source is treated as signed 0 and the DENn bit is set. The INEX
bit is set except when the other source is signed infinity, signed NaN, or signed
0. Therefore, a signed infinity multiplied by a denormalized number gives a
signed NaN_out and sets the INVAL bit.
5) If rounding is performed, the INEX bit is set.
Execution if (cond) src1 × src2 → dst
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Read src1_l, src1_l, src1_h, src1_h,
src2_l src2_h src2_l src2_h
Written dst_l dst_h
Unit in use .M .M .M .M
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-413
Submit Documentation Feedback
4.189 MPYDP
— www.ti.com
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYSP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 9
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 4021 3333h
A3:A2 C004 0000h 0000 0000h -2.5 A3:A2 C004 0000h 0000 0000h
A5:A4 xxxx xxxxh xxxx xxxxh A5:A4 C035 8000h 0000 0000h -21.5
4-414 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.190 MPYH
www.ti.com —
4.190 MPYH
Multiply Signed 16 MSB × Signed 16 MSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 0 1 0 0 0 0 0 s p
5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are signed by default.
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-415
Submit Documentation Feedback
4.190 MPYH
— www.ti.com
4-416 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.191 MPYHI
www.ti.com —
4.191 MPYHI
Multiply 16 MSB × 32-Bit Into 64-Bit Result
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 0 0 1 1 0 0 s p
5 1 1 1
Description Performs a 16-bit by 32-bit multiply. The upper half of src1 is used as a signed 16-bit
input. The value in src2 is treated as a signed 32-bit value. The result is written into the
lower 48 bits of a 64-bit register pair, dst_o:dst_e, and sign extended to 64 bits.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-417
Submit Documentation Feedback
4.191 MPYHI
— www.ti.com
Example 2
4-418 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.192 MPYHIR
www.ti.com —
4.192 MPYHIR
Multiply 16 MSB × 32-Bit, Shifted by 15 to Produce a Rounded 32-Bit Result
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 0 0 0 1 1 0 0 s p
5 1 1 1
Description Performs a 16-bit by 32-bit multiply. The upper half of src1 is treated as a signed 16-bit
input. The value in src2 is treated as a signed 32-bit value. The product is then rounded
to a 32-bit result by adding the value 214 and then this sum is right shifted by 15. The
lower 32 bits of the result are written into dst.
31 16 15 0
a_hi a_lo ←src1
×
MPYHIR
31 0
((a_hi × b_hi:b_lo) + 4000h) >> 15 ←dst
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-419
Submit Documentation Feedback
4.192 MPYHIR
— www.ti.com
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
4-420 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.193 MPYHL
www.ti.com —
4.193 MPYHL
Multiply Signed 16 MSB × Signed 16 LSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 0 1 0 0 0 0 0 s p
5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are signed by default.
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-421
Submit Documentation Feedback
4.193 MPYHL
— www.ti.com
4-422 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.194 MPYHLU
www.ti.com —
4.194 MPYHLU
Multiply Unsigned 16 MSB × Unsigned 16 LSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 1 1 0 0 0 0 0 s p
5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are unsigned by default.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-423
Submit Documentation Feedback
4.195 MPYHSLU
— www.ti.com
4.195 MPYHSLU
Multiply Signed 16 MSB × Unsigned 16 LSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 1 0 0 0 0 0 s p
5 1 1 1
Description The signed operand src1 is multiplied by the unsigned operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
4-424 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.196 MPYHSU
www.ti.com —
4.196 MPYHSU
Multiply Signed 16 MSB × Unsigned 16 MSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 1 0 0 0 0 0 s p
5 1 1 1
Description The signed operand src1 is multiplied by the unsigned operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-425
Submit Documentation Feedback
4.196 MPYHSU
— www.ti.com
4-426 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.197 MPYHU
www.ti.com —
4.197 MPYHU
Multiply Unsigned 16 MSB × Unsigned 16 MSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 1 0 0 0 0 0 s p
5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are unsigned by default.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-427
Submit Documentation Feedback
4.197 MPYHU
— www.ti.com
4-428 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.198 MPYHULS
www.ti.com —
4.198 MPYHULS
Multiply Unsigned 16 MSB × Signed 16 LSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 0 1 0 0 0 0 0 s p
5 1 1 1
Description The unsigned operand src1 is multiplied by the signed operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-429
Submit Documentation Feedback
4.199 MPYHUS
— www.ti.com
4.199 MPYHUS
Multiply Unsigned 16 MSB × Signed 16 MSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 0 1 0 0 0 0 0 s p
5 1 1 1
Description The unsigned operand src1 is multiplied by the signed operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
4-430 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.200 MPYI
www.ti.com —
4.200 MPYI
Multiply 32-Bit × 32-Bit Into 32-Bit Result
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 7 6 5 4 3 2 1 0
src1 x op 0 0 0 0 0 s p
5 1 5 1 1
Description The src1 operand is multiplied by the src2 operand. The lower 32 bits of the result are
placed in dst.
Pipeline
Pipeline Stage E1 E2 E3 E4 E5 E6 E7 E8 E9
Read src1, src1, src1, src1,
src2 src2 src2 src2
Written dst
Unit in use .M .M .M .M
Delay Slots 8
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-431
Submit Documentation Feedback
4.200 MPYI
— www.ti.com
4-432 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.201 MPYID
www.ti.com —
4.201 MPYID
Multiply 32-Bit × 32-Bit Into 64-Bit Result
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 7 6 5 4 3 2 1 0
src1 x op 0 0 0 0 0 s p
5 1 5 1 1
Description The src1 operand is multiplied by the src2 operand. The 64-bit result is placed in the dst
register pair.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Read src1, src1, src1, src1,
src2 src2 src2 src2
Written dst_l dst_h
Unit in use .M .M .M .M
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-433
Submit Documentation Feedback
4.201 MPYID
— www.ti.com
4-434 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.202 MPYIH
www.ti.com —
4.202 MPYIH
Multiply 32-Bit × 16-MSB Into 64-Bit Result
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 0 0 1 1 0 0 s p
5 1 1 1
Description The MPYIH pseudo-operation performs a 16-bit by 32-bit multiply. The upper half of
src1 is used as a signed 16-bit input. The value in src2 is treated as a signed 32-bit value.
The result is written into the lower 48 bits of a 64-bit register pair, dst_o:dst_e, and sign
extended to 64 bits. The assembler uses the MPYHI (.unit) src1, src2, dst instruction to
perform this operation.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-435
Submit Documentation Feedback
4.203 MPYIHR
— www.ti.com
4.203 MPYIHR
Multiply 32-Bit × 16 MSB, Shifted by 15 to Produce a Rounded 32-Bit Result
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 0 0 0 1 1 0 0 s p
5 1 1 1
Description The MPYIHR pseudo-operation performs a 16-bit by 32-bit multiply. The upper half
of src1 is treated as a signed 16-bit input. The value in src2 is treated as a signed 32-bit
value. The product is then rounded to a 32-bit result by adding the value 214 and then
this sum is right shifted by 15. The lower 32 bits of the result are written into dst. The
assembler uses the MPYHIR (.unit) src1, src2, dst instruction to perform this
operation.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
4-436 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.204 MPYIL
www.ti.com —
4.204 MPYIL
Multiply 32-Bit × 16 LSB Into 64-Bit Result
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 0 1 1 1 0 0 s p
5 1 1 1
Description The MPYIL pseudo-operation performs a 16-bit by 32-bit multiply. The lower half of
src1 is used as a signed 16-bit input. The value in src2 is treated as a signed 32-bit value.
The result is written into the lower 48 bits of a 64-bit register pair, dst_o:dst_e, and sign
extended to 64 bits. The assembler uses the MPYLI (.unit) src1, src2, dst instruction to
perform this operation.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-437
Submit Documentation Feedback
4.205 MPYILR
— www.ti.com
4.205 MPYILR
Multiply 32-Bit × 16 LSB, Shifted by 15 to Produce a Rounded 32-Bit Result
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 1 0 1 1 0 0 s p
5 1 1 1
Description The MPYILR pseudo-operation performs a 16-bit by 32-bit multiply. The lower half of
src1 is used as a signed 16-bit input. The value in src2 is treated as a signed 32-bit value.
The product is then rounded to a 32-bit result by adding the value 214 and then this sum
is right shifted by 15. The lower 32 bits of the result are written into dst. The assembler
uses the MPYLIR (.unit) src1, src2, dst instruction to perform this operation.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
4-438 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.206 MPYLH
www.ti.com —
4.206 MPYLH
Multiply Signed 16 LSB × Signed 16 MSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 0 1 0 0 0 0 0 s p
5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are signed by default.
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-439
Submit Documentation Feedback
4.206 MPYLH
— www.ti.com
4-440 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.207 MPYLHU
www.ti.com —
4.207 MPYLHU
Multiply Unsigned 16 LSB × Unsigned 16 MSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 1 1 0 0 0 0 0 s p
5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are unsigned by default.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-441
Submit Documentation Feedback
4.208 MPYLI
— www.ti.com
4.208 MPYLI
Multiply 16 LSB × 32-Bit Into 64-Bit Result
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 0 1 1 1 0 0 s p
5 1 1 1
Description Performs a 16-bit by 32-bit multiply. The lower half of src1 is used as a signed 16-bit
input. The value in src2 is treated as a signed 32-bit value. The result is written into the
lower 48 bits of a 64-bit register pair, dst_o:dst_e, and sign extended to 64 bits.
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Examples Example 1
4-442 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.208 MPYLI
www.ti.com —
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 FFFF FA9Bh A111 462Ch
-5,928,647,571,924
Example 2
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 0000 06FBh E9FA 7E81h
7,679,032,065,665
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-443
Submit Documentation Feedback
4.209 MPYLIR
— www.ti.com
4.209 MPYLIR
Multiply 16 LSB × 32-Bit, Shifted by 15 to Produce a Rounded 32-Bit Result
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 1 0 1 1 0 0 s p
5 1 1 1
Description Performs a 16-bit by 32-bit multiply. The lower half of src1 is treated as a signed 16-bit
input. The value in src2 is treated as a signed 32-bit value. The product is then rounded
into a 32-bit result by adding the value 214 and then this sum is right shifted by 15. The
lower 32 bits of the result are written into dst.
31 16 15 0
a_hi a_lo ←src1
×
MPYLIR
31 0
((a_lo × b_hi:b_lo) + 4000h) >> 15 ←dst
4-444 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.209 MPYLIR
www.ti.com —
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-445
Submit Documentation Feedback
4.210 MPYLSHU
— www.ti.com
4.210 MPYLSHU
Multiply Signed 16 LSB × Unsigned 16 MSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 1 1 0 0 0 0 0 s p
5 1 1 1
Description The signed operand src1 is multiplied by the unsigned operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
4-446 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.211 MPYLUHS
www.ti.com —
4.211 MPYLUHS
Multiply Unsigned 16 LSB × Signed 16 MSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 0 1 0 0 0 0 0 s p
5 1 1 1
Description The unsigned operand src1 is multiplied by the signed operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-447
Submit Documentation Feedback
4.212 MPYSP
— www.ti.com
4.212 MPYSP
Multiply Two Single-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 0 0 0 0 0 0 0 s p
5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst.
Note—
1) If one source is SNaN or QNaN, the result is a signed NaN_out. If either source
is SNaN, the INVAL bit is set also. The sign of NaN_out is the exclusive-OR of
the input signs.
2) Signed infinity multiplied by signed infinity or a normalized number (other
than signed 0) returns signed infinity. Signed infinity multiplied by signed 0
returns a signed NaN_out and sets the INVAL bit.
3) If one or both sources are signed 0, the result is signed 0 unless the other source
is NaN or signed infinity, in which case the result is signed NaN_out.
4) A denormalized source is treated as signed 0 and the DENn bit is set. The INEX
bit is set except when the other source is signed infinity, signed NaN, or signed
0. Therefore, a signed infinity multiplied by a denormalized number gives a
signed NaN_out and sets the INVAL bit.
5) If rounding is performed, the INEX bit is set.
Execution if (cond) src1 × src2 → dst
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
4-448 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.212 MPYSP
www.ti.com —
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-449
Submit Documentation Feedback
4.213 MPYSPDP
— www.ti.com
4.213 MPYSPDP
Multiply Single-Precision Floating-Point Value × Double-Precision Floating-Point
Value
Opcode
31 29 28 27 23 22 18
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 1 0 1 1 0 0 s p
5 1 1 1
Description The single-precision src1 operand is multiplied by the double-precision src2 operand
to produce a double-precision result. The result is placed in dst.
Note—
1) If one source is SNaN or QNaN, the result is a signed NaN_out. If either source
is SNaN, the INVAL bit is set also. The sign of NaN_out is the exclusive-OR of
the input signs.
2) Signed infinity multiplied by signed infinity or a normalized number (other
than signed 0) returns signed infinity. Signed infinity multiplied by signed 0
returns a signed NaN_out and sets the INVAL bit.
3) If one or both sources are signed 0, the result is signed 0 unless the other source
is NaN or signed infinity, in which case the result is signed NaN_out.
4) A denormalized source is treated as signed 0 and the DENn bit is set. The INEX
bit is set except when the other source is signed infinity, signed NaN, or signed
0. Therefore, a signed infinity multiplied by a denormalized number gives a
signed NaN_out and sets the INVAL bit.
5) If rounding is performed, the INEX bit is set.
Execution if (cond) src1 × src2 → dst
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5 E6 E7
Read src1, src2_l src1,
src2_h
Written dst_l dst_h
Unit in use .M .M
4-450 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.213 MPYSPDP
www.ti.com —
The low half of the result is written out one cycle earlier than the high half. If dst is used
as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP, MPYDP,
MPYSPDP, MPYSP2DP, or SUBDP instruction, the number of delay slots can be
reduced by one, because these instructions read the lower word of the DP source one
cycle before the upper word of the DP source.
Delay Slots 6
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-451
Submit Documentation Feedback
4.214 MPYSP2DP
— www.ti.com
4.214 MPYSP2DP
Multiply Two Single-Precision Floating-Point Values for Double-Precision Result
Opcode
31 29 28 27 23 22 18
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 1 1 1 1 0 0 s p
5 1 1 1
Description The src1 operand is multiplied by the src2 operand to produce a double-precision
result. The result is placed in dst.
Note—
1) If one source is SNaN or QNaN, the result is a signed NaN_out. If either source
is SNaN, the INVAL bit is set also. The sign of NaN_out is the exclusive-OR of
the input signs.
2) Signed infinity multiplied by signed infinity or a normalized number (other
than signed 0) returns signed infinity. Signed infinity multiplied by signed 0
returns a signed NaN_out and sets the INVAL bit.
3) If one or both sources are signed 0, the result is signed 0 unless the other source
is NaN or signed infinity, in which case the result is signed NaN_out.
4) A denormalized source is treated as signed 0 and the DENn bit is set. The INEX
bit is set except when the other source is signed infinity, signed NaN, or signed
0. Therefore, a signed infinity multiplied by a denormalized number gives a
signed NaN_out and sets the INVAL bit.
5) If rounding is performed, the INEX bit is set.
Execution if (cond) src1 × src2 → dst
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read src1, src2
Written dst_l dst_h
Unit in use .M
4-452 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.214 MPYSP2DP
www.ti.com —
The low half of the result is written out one cycle earlier than the high half. If dst is used
as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP, MPYDP,
MPYSPDP, MPYSP2DP, or SUBDP instruction, the number of delay slots can be
reduced by one, because these instructions read the lower word of the DP source one
cycle before the upper word of the DP source.
Delay Slots 4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-453
Submit Documentation Feedback
4.215 MPYSU
— www.ti.com
4.215 MPYSU
Multiply Signed 16 LSB × Unsigned 16 LSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 7 6 5 4 3 2 1 0
src1 x op 0 0 0 0 0 s p
5 1 5 1 1
Description The signed operand src1 is multiplied by the unsigned operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
Execution if (cond) lsb16(src1) × lsb16(src2) → dst
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
4-454 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.215 MPYSU
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-455
Submit Documentation Feedback
4.216 MPYSU4
— www.ti.com
4.216 MPYSU4
Multiply Signed × Unsigned, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 0 1 1 1 0 0 s p
5 1 1 1
Description Returns the product between four sets of packed 8-bit values producing four signed
16-bit results. The four signed 16-bit results are packed into a 64-bit register pair,
dst_o:dst_e. The values in src1 are treated as signed 8-bit packed quantities; whereas,
the values in src2 are treated as unsigned 8-bit packed data.
For each pair of 8-bit quantities in src1 and src2, the signed 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2:
• The product of src1 byte 0 and src2 byte 0 is written to the lower half of dst_e.
• The product of src1 byte 1 and src2 byte 1 is written to the upper half of dst_e.
• The product of src1 byte 2 and src2 byte 2 is written to the lower half of dst_o.
• The product of src1 byte 3 and src2 byte 3 is written to the upper half of dst_o.
31 24 23 16 15 8 7 0
sa_3 sa_2 sa_1 sa_0 ←src1
× × × ×
MPYSU4
63 48 47 32 31 16 15 0
sa_3 × ub_3 sa_2 × ub_2 sa_1 × ub_1 sa_0 × ub_0 ←dst_o:dst_e
Execution if (cond) {
4-456 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.216 MPYSU4
www.ti.com —
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
MPYSU4 .M1 A5,A6,A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 494A 16A8h 072C BA2Ch
18762 5800 1386 -17876
signed
Example 2
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 2FFD FCA4h 00A0 0440h
12285 -680 160 1088
signed
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-457
Submit Documentation Feedback
4.217 MPYU
— www.ti.com
4.217 MPYU
Multiply Unsigned 16 LSB × Unsigned 16 LSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 1 1 0 0 0 0 0 s p
5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are unsigned by default.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
4-458 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.217 MPYU
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-459
Submit Documentation Feedback
4.218 MPYU2
— www.ti.com
4.218 MPYU2
Multiply Unsigned by Unsigned, Packed 16-bit
31 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
3 5 5 5 5
Description The MPY2 instruction performs 16-bit multiplication between unsigned packed 16-bit
quantities. The values in op1 and xop2 are treated as unsigned packed 16-bit quantities.
The 32-bit results are placed in a 64-bit register pair.
The product of the lower 16-bit quantities in op1 and xop2 is written to the even register
of the destination, dst_e. The product of the upper 16-bit quantities in op1 and xop2 is
written to the odd register of the destination, dst_o.
Effectively, MPY2 op1, xop2, dst_o:dst_e performs the same operation as MPYU op1,
xop2, dst_e and MPYUH op1, xop2, dst_o together.
31 24 23 16 15 8 7 0
C A ← op1
D B ← xop2
v v
A*B ← dst _e
C*D ← dst _o
Where (unsigned int) dst_e = (unsigned int)((unsigned short) a * (unsigned short) b);
and (unsigned int) dst_o = (unsigned int)((unsigned short) c * (unsigned short) d);
4-460 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.218 MPYU2
www.ti.com —
Execution if(cond) {
ulsb16(src1) x ulsb(src2) -> dst_e
umsb16(src1) x umsb(src2) -> dst_o
}
else nop
Instruction Type
Delay Slots 3
A12 == 0x7fff7fff
A23 == 0x7fff7fff
MPYU2 .M A12,A23,A11:A10
A11 == 0x3fff0001
A10 == 0x3fff0001
A12 == 0x80017fff
A23 == 0x7fff8001
MPYU2 .M A12,A23,A11:A10
A11 == 0x3fffffff
A10 == 0x3fffffff
A12 == 0xc333c333
A23 == 0x3ccdc333
MPYU2 .M A12,A23,A11:A10
A11 == 0x2e5c43d7
A10 == 0x94d6bc29
A12 == 0x80008000
A23 == 0x80008000
MPYU2 .M A12,A23,A11:A10
A11 == 0x40000000
A10 == 0x40000000
A12 == 0x321089ab
A23 == 0x87654321
MPYU2 .M A12,A23,A11:A10
A11 == 0x1a7a3050
A10 == 0x2419800b
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-461
Submit Documentation Feedback
4.219 MPYU4
— www.ti.com
4.219 MPYU4
Multiply Unsigned × Unsigned, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 0 0 1 1 0 0 s p
5 1 1 1
Description Returns the product between four sets of packed 8-bit values producing four unsigned
16-bit results that are packed into a 64-bit register pair, dst_o:dst_e. The values in both
src1 and src2 are treated as unsigned 8-bit packed data.
For each pair of 8-bit quantities in src1 and src2, the unsigned 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2:
• The product of src1 byte 0 and src2 byte 0 is written to the lower half of dst_e.
• The product of src1 byte 1 and src2 byte 1 is written to the upper half of dst_e.
• The product of src1 byte 2 and src2 byte 2 is written to the lower half of dst_o.
• The product of src1 byte 3 and src2 byte 3 is written to the upper half of dst_o.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ←src1
× × × ×
MPYU4
63 48 47 32 31 16 15 0
ua_3 × ub_3 ua_2 × ub_2 ua_1 × ub_1 ua_0 × ub_0 ←dst_o:dst_e
Execution if (cond) {
4-462 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.219 MPYU4
www.ti.com —
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
MPYU4 .M1 A5,A6,A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 47E8 16A8h 212C 6231h
18408 5800 8492 25137
unsigned
Example 2
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 2E77 4D44h 00A0 21BCh
11895 19780 160 8636
unsigned
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-463
Submit Documentation Feedback
4.220 MPYUS
— www.ti.com
4.220 MPYUS
Multiply Unsigned 16 LSB × Signed 16 LSB
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 0 1 0 0 0 0 0 s p
5 1 1 1
Description The unsigned operand src1 is multiplied by the signed operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
4-464 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.220 MPYUS
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-465
Submit Documentation Feedback
4.221 MPYUS4
— www.ti.com
4.221 MPYUS4
Multiply Unsigned × Signed, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 0 1 1 1 0 0 s p
5 1 1 1
Description The MPYUS4 pseudo-operation returns the product between four sets of packed 8-bit
values, producing four signed 16-bit results. The four signed 16-bit results are packed
into a 64-bit register pair, dst_o:dst_e. The values in src1 are treated as signed 8-bit
packed quantities; whereas, the values in src2 are treated as unsigned 8-bit packed data.
The assembler uses the MPYSU4 (.unit)src1, src2, dst instruction to perform this
operation.
For each pair of 8-bit quantities in src1 and src2, the signed 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2:
• The product of src1 byte 0 and src2 byte 0 is written to the lower half of dst_e.
• The product of src1 byte 1 and src2 byte 1 is written to the upper half of dst_e.
• The product of src1 byte 2 and src2 byte 2 is written to the lower half of dst_o.
• The product of src1 byte 3 and src2 byte 3 is written to the upper half of dst_o.
Execution if (cond) {
4-466 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.221 MPYUS4
www.ti.com —
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-467
Submit Documentation Feedback
4.222 MV
— www.ti.com
4.222 MV
Move From Register to Register
Opcode .L unit
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 5 4 3 2 1 0
0 0 0 0 op 1 1 0 s p
7 1 1
31 29 28 27 23 22 18
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 0 1 1 1 1 1 1 0 1 1 0 s p
5 1 1
Opcode .S unit
4-468 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.222 MV
www.ti.com —
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 0 1 1 0 1 0 0 0 s p
1 1 1
Opcode .D unit
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 1 0 0 1 0 1 0 0 0 0 s p
1 1
Description The MV pseudo-operation moves a value from one register to another. The assembler
will either use the ADD (.unit) 0, src2, dst or the OR (.unit) 0, src2, dst operation to
perform this task.
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-469
Submit Documentation Feedback
4.223 MVC
— www.ti.com
4.223 MVC
Move Between Control File and Register File
unit = .S2
Opcode
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
crhi x 0 0 1 1 1 1 1 0 0 0 s p
5 1 1 1
Description The contents of the control file specified by the crhi and crlo fields is moved to the
register file specified by the dst field. Valid assembler values for crlo and crhi are shown
in Table 4-10.
Operands when moving from
the register file to the control
file:
31 29 28 27 23 22 18
creg z crlo src2
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
crhi x 0 0 1 1 1 0 1 0 0 0 s p
5 1 1 1
4-470 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.223 MVC
www.ti.com —
Description The contents of the register file specified by the src2 field is moved to the control file
specified by the crhi and crlo fields. Valid assembler values for crlo and crhi are shown
in Table 4-10.
else nop
Any write to the ISR or ICR (by the MVC instruction) effectively has one delay slot
because the results cannot be read (by the MVC instruction) in the IFR until two cycles
after the write to the ISR or ICR.
Delay Slots 0
B1 F0090001h B1 F0090001h
Note—The six MSBs of the AMR are reserved and therefore are not written to.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-471
Submit Documentation Feedback
4.223 MVC
— www.ti.com
Table 4-10 contains the register addresses required to access the control registers.
Table 4-10 Register Addresses for Accessing the Control Registers
Address Supervisor User
Acronym Register Name crhi crlo Read/Write1 Read/Write1
AMR Addressing mode register 00000 00000 R, W R, W
0xxxx 00000
CSR Control status register 00000 00001 R, W* R, W*
00001 00001
0xxxx 00001
DNUM DSP core number register 00000 10001 R R
ECR Exception clear register 00000 11101 W X
EFR Exception flag register 00000 11101 R X
FADCR Floating-point adder configuration 00000 10010 R, W R, W
register
FAUCR Floating-point auxiliary configuration 00000 10011 R, W R, W
register
FMCR Floating-point multiplier configuration 00000 10100 R, W R, W
register
GFPGFR Galois field multiply control register 11000 R, W R, W
GPLYA GMPY A-side polynomial register 00000 10110 R, W R, W
GPLYB GMPY B-side polynomial register 00000 10111 R, W R, W
ICR Interrupt clear register 00000 00011 W X
0xxxx 00011
IER Interrupt enable register 00000 00100 R, W X
0xxxx 00100
IERR Internal exception report register 00000 11111 R,W X
IFR Interrupt flag register 00000 00010 R X
00010 00010
ILC Inner loop count register 00000 01101 R, W R, W
IRP Interrupt return pointer register 00000 00110 R, W R, W
0xxxx 00110
ISR Interrupt set register 00000 00010 W X
0xxxx 00010
ISTP Interrupt service table pointer register 00000 00101 R, W X
0xxxx 00101
ITSR Interrupt task state register 00000 11011 R, W X
NRP Nonmaskable interrupt or exception 00000 00111 R, W R, W
return pointer register
0xxxx 00111
NTSR NMI/Exception task state register 00000 11100 R, W X
PCE1 Program counter, E1 phase 00000 10000 R R
10000 10000
REP Restricted entry point address register 00000 01111 R, W X
RILC Reload inner loop count register 00000 01110 R, W R, W
SSR Saturation status register 00000 10101 R, W R, W
TSCH Time-stamp counter (high 32 bits) register 00000 01011 R R
TSCL Time-stamp counter (low 32 bits) register 00000 01010 R R
TSR Task state register 00000 11010 R, W* R,W*
4-472 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.223 MVC
www.ti.com —
1. Legend: R = Readable by the MVC instruction; W = Writable by the MVC instruction; W* = Partially writable by the MVC
instruction; X = Access causes exception
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-473
Submit Documentation Feedback
4.224 MVD
— www.ti.com
4.224 MVD
Move From Register to Register, Delayed
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 0 x 0 0 0 0 1 1 1 1 0 0 s p
1 1 1
Description Moves data from the src2 register to the dst register over 4 cycles. This is done using the
multiplier path.
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2
Written dst
Unit in use .M
Delay Slots 3
4-474 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.225 MVK
www.ti.com —
4.225 MVK
Move Signed Constant Into Register and Sign Extend
Opcode .S unit
31 29 28 27 23
creg z dst
3 1 5
22 7 6 5 4 3 2 1 0
cst16 0 1 0 1 0 s p
16 1 1
Opcode .L unit
31 29 28 27 23 22 18 17 16
creg z dst cst5 0 0
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 0 1 x 0 0 1 1 0 1 0 1 1 0 s p
1 1 1
Opcode .D unit
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-475
Submit Documentation Feedback
4.225 MVK
— www.ti.com
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
cst5 0 0 0 0 0 0 1 0 0 0 0 s p
5 1 1
Description The constant cst is sign extended and placed in dst. The .S unit form allows for a 16-bit
signed constant.
Since many nonaddress constants fall into a 5-bit sign constant range, this allows the
flexibility to schedule the MVK instruction on the .L or .D units. In the .D unit form,
the constant is in the position normally used by src1, as for address math.
In most cases, the C6000 assembler and linker issue a warning or an error when a
constant is outside the range supported by the instruction. In the case of MVK .S, a
warning is issued whenever the constant is outside the signed 16-bit range, -32768 to
32767 (or FFFF8000h to 00007FFFh).
For example:
MVK .S1 0x00008000X, A0
Pipeline
Pipeline Stage E1
Read
Written dst
Unit in use .L, .S, or .D
Delay Slots 0
Examples Example 1
4-476 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.225 MVK
www.ti.com —
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-477
Submit Documentation Feedback
4.226 MVKH/MVKLH
— www.ti.com
4.226 MVKH/MVKLH
Move 16-Bit Constant Into Upper Bits of Register
or
Opcode
31 29 28 27 23
creg z dst
3 1 5
22 7 6 5 4 3 2 1 0
cst16 h 1 0 1 0 s p
16 1 1 1
Description The 16-bit constant, cst16, is loaded into the upper 16 bits of dst. The 16 LSBs of dst are
unchanged. For the MVKH instruction, the assembler encodes the 16 MSBs of a 32-bit
constant into the cst16 field of the opcode. For the MVKLH instruction, the assembler
encodes the 16 LSBs of a constant into the cst16 field of the opcode
Note—Use the MVK instruction (see MVK) to load 16-bit constants. The
assembler generates a warning for any constant over 16 bits. To load 32-bit
constants, such as 1234 5678h, use the following pair of instructions:
MVKL 0x12345678
MVKH 0x12345678
If you are loading the address of a label, use:
MVKL label
MVKH label
Execution For the MVKLH instruction:
else nop
4-478 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.226 MVKH/MVKLH
www.ti.com —
else nop
Pipeline
Pipeline Stage E1
Read
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
A1 00007634h A1 0A327634h
Example 2
A1 FFFFF25Ah A1 07A8F25Ah
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-479
Submit Documentation Feedback
4.227 MVKL
— www.ti.com
4.227 MVKL
Move Signed Constant Into Register and Sign Extend
Opcode
31 29 28 27 23
creg z dst
3 1 5
22 7 6 5 4 3 2 1 0
cst16 0 1 0 1 0 s p
16 1 1
Description The 16-bit constant, cst16, is sign extended and placed in dst.
The MVKL instruction is equivalent to the MVK instruction (see MVK), except that
the MVKL instruction disables the constant range checking normally performed by the
assembler/linker. This allows the MVKL instruction to be paired with the MVKH
instruction (see MVKH/MVKLH) to generate 32-bit constants.
To load 32-bit constants, such as 1234ABCDh, use the following pair of instructions:
4-480 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.227 MVKL
www.ti.com —
else nop
Pipeline
Pipeline Stage E1
Read
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-481
Submit Documentation Feedback
4.228 NEG
— www.ti.com
4.228 NEG
Negate
or
Opcode .S unit
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 1 0 1 1 0 1 0 0 0 s p
1 1 1
Opcode .L unit
31 29 28 27 23 22 18 17 16
15 14 13 12 11 5 4 3 2 1 0
0 0 0 x op 1 1 0 s p
1 7 1 1
Description The NEG pseudo-operation negates src2 and places the result in dst. The assembler
uses SUB (.unit) 0, src2, dst to perform this operation.
Execution if (cond) 0 -s src2 → dst
4-482 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.228 NEG
www.ti.com —
else nop
Instruction Type Single-cycle
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-483
Submit Documentation Feedback
4.229 NOP
— www.ti.com
4.229 NOP
No Operation
unit = none
Opcode
31 18 17
Reserved 0
14
16 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src 0 0 0 0 0 0 0 0 0 0 0 0 p
4 1
Description src is encoded as count- 1. For src+ 1 cycles, no operation is performed. The maximum
value for count is 9. NOP with no operand is treated like NOP 1 with src encoded as
0000.
A multicycle NOP will not finish if a branch is completed first. For example, if a branch
is initiated on cycle n and a NOP5 instruction is initiated on cycle n + 3, the branch is
complete on cycle n + 6 and the NOP is executed only from cycle n + 3 to cycle n + 5.
A single-cycle NOP in parallel with other instructions does not affect operation.
A multicycle NOP instruction cannot be paired with any other multicycle NOP
instruction in the same execute packet. Instructions that generate a multicycle NOP
are: ADDKPC, BNOP, CALLP, and IDLE.
Delay Slots 0
Examples Example 1
4-484 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.229 NOP
www.ti.com —
Example 2
NOP 5
A1 00000001h A1 00000004h
A2 00000003h A2 00000003h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-485
Submit Documentation Feedback
4.230 NORM
— www.ti.com
4.230 NORM
Normalize Integer
or
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 5 4 3 2 1 0
0 0 0 x op 1 1 0 s p
1 7 1 1
Description The number of redundant sign bits of src2 is placed in dst. Several examples are shown
in the following diagram.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
0 1 x x x x x x x x x x x x x x
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
x x x x x x x x x x x x x x x x
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
0 0 0 0 1 x x x x x x x x x x x
4-486 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.230 NORM
www.ti.com —
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
x x x x x x x x x x x x x x x x
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
A1 02A3469Fh A1 02A3469Fh
A2 xxxxxxxxh A2 00000005h 5
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-487
Submit Documentation Feedback
4.230 NORM
— www.ti.com
A1 FFFFF25Ah A1 FFFFF25Ah
A2 xxxxxxxxh A2 00000013h 19
Example 3
4-488 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.231 NOT
www.ti.com —
4.231 NOT
Bitwise NOT
Opcode .L unit
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 x 1 1 0 1 1 1 0 1 1 0 s p
1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 16
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 x 0 0 1 0 1 0 1 0 0 0 s p
1 1 1
Description The NOT pseudo-operation performs a bitwise NOT on the src2 operand and places
the result in dst. The assembler uses XOR (.unit) -1, src2, dst to perform this operation.
else nop
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-489
Submit Documentation Feedback
4.232 OR
— www.ti.com
4.232 OR
Bitwise OR
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
3 5 5 5 6
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
5 5 5 6
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x opcode 1 1 0 s p
3 5 5 5 7
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
5 5 5 6
4-490 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.232 OR
www.ti.com —
31 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
3 5 5 5 4
Description The OR instruction performs the bit-wise OR between two source registers and stores
the result in a third register.
See Also
Example A0 == 0xfadedb0a
OR .S 7,A0,A15
A15 == 0xfadedb0f
A0 == 0xbeef0000
A1 == 0x0000babe
OR .S A0,A1,A2
A2 == 0xbeefbabe
A1 == 0xbeef0000
A0 == 0xbeef0000
A3 == 0x0000babe
A2 == 0xffffbabe
OR .S A1:A0,A3:A2,A9:A8
A9 == 0xbeefbabe
A8 == 0xffffbabe
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-491
Submit Documentation Feedback
4.233 PACK2
— www.ti.com
4.233 PACK2
Pack Two 16 LSBs Into Upper and Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 0 0 0 0 1 1 0 s p
5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 1 1 1 1 1 0 0 s p
5 1 1 1
Description Moves the lower halfwords from src1 and src2 and packs them both into dst. The lower
halfword of src1 is placed in the upper halfword of dst. The lower halfword of src2 is
placed in the lower halfword of dst.
This instruction is useful for manipulating and preparing pairs of 16-bit values to be
used by the packed arithmetic operations, such as ADD2 (ADD2).
4-492 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.233 PACK2
www.ti.com —
31 16 15 0
a_hi a_lo ←src1
PACK2
31 16 15 0
a_lo b_lo ←dst
Execution if (cond) {
lsb16(src2) → lsb16(dst);
lsb16(src1) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-493
Submit Documentation Feedback
4.233 PACK2
— www.ti.com
4-494 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.234 PACKH2
www.ti.com —
4.234 PACKH2
Pack Two 16 MSBs Into Upper and Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 1 1 0 1 1 0 s p
5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 0 0 1 1 0 0 0 s p
5 1 1 1
Description Moves the upper halfwords from src1 and src2 and packs them both into dst. The upper
halfword of src1 is placed in the upper half-word of dst. The upper halfword of src2 is
placed in the lower halfword of dst.
This instruction is useful for manipulating and preparing pairs of 16-bit values to be
used by the packed arithmetic operations, such as ADD2 (see ADD2).
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-495
Submit Documentation Feedback
4.234 PACKH2
— www.ti.com
31 16 15 0
a_hi a_lo ←src1
PACKH2
31 16 15 0
a_hi b_hi ←dst
Execution if (cond) {
msb16(src2) → lsb16(dst);
msb16(src1) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
Example 2
4-496 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.234 PACKH2
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-497
Submit Documentation Feedback
4.235 PACKH4
— www.ti.com
4.235 PACKH4
Pack Four High Bytes Into Four 8-Bit Halfwords
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 1 0 0 1 1 1 0 s p
5 1 1 1
Description Moves the high bytes of the two halfwords in src1 and src2, and packs them into dst.
The bytes from src1 are packed into the most-significant bytes of dst, and the bytes from
src2 are packed into the least-significant bytes of dst.
• The high byte of the upper halfword of src1 is moved to the upper byte of the
upper halfword of dst. The high byte of the lower halfword of src1 is moved to the
lower byte of the upper halfword of dst.
• The high byte of the upper halfword of src2 is moved to the upper byte of the
lower halfword of dst. The high byte of the lower halfword of src2 is moved to the
lower byte of the lower halfword of dst.
31 24 23 16 15 8 7 0
a_3 a_2 a_1 a_0 ←src1
PACKH4
31 24 23 16 15 8 7 0
a_3 a_1 b_3 b_1 ←dst
Execution if (cond) {
byte3(src1) → byte3(dst);
byte1(src1) → byte2(dst);
byte3(src2) → byte1(dst);
byte1(src2) → byte0(dst)
4-498 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.235 PACKH4
www.ti.com —
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
A2 37 89 F2 3Ah A2 37 89 F2 3Ah
A8 04 B8 49 75h A8 04 B8 49 75h
Example 2
B2 01 24 24 51h B2 01 24 24 51h
B8 01 A6 A0 51h B8 01 A6 A0 51h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-499
Submit Documentation Feedback
4.236 PACKHL2
— www.ti.com
4.236 PACKHL2
Pack 16 MSB Into Upper and 16 LSB Into Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 1 0 0 1 1 0 s p
5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 0 0 0 1 0 0 0 s p
5 1 1 1
Description Moves the upper halfword from src1 and the lower halfword from src2 and packs them
both into dst. The upper halfword of src1 is placed in the upper halfword of dst. The
lower halfword of src2 is placed in the lower halfword of dst.
This instruction is useful for manipulating and preparing pairs of 16-bit values to be
used by the packed arithmetic operations, such as ADD2 (see ADD2).
4-500 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.236 PACKHL2
www.ti.com —
31 16 15 0
a_hi a_lo ←src1
PACKHL2
31 16 15 0
a_hi b_lo ←dst
Execution if (cond) {
lsb16(src2) → lsb16(dst);
msb16(src1) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-501
Submit Documentation Feedback
4.236 PACKHL2
— www.ti.com
4-502 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.237 PACKLH2
www.ti.com —
4.237 PACKLH2
Pack 16 LSB Into Upper and 16 MSB Into Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 0 1 1 1 1 0 s p
5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 0 0 0 1 0 0 0 s p
5 1 1 1
Description Moves the lower halfword from src1, and the upper halfword from src2, and packs them
both into dst. The lower halfword of src1 is placed in the upper halfword of dst. The
upper halfword of src2 is placed in the lower halfword of dst.
This instruction is useful for manipulating and preparing pairs of 16-bit values to be
used by the packed arithmetic operations, such as ADD2 (see ADD2).
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-503
Submit Documentation Feedback
4.237 PACKLH2
— www.ti.com
31 16 15 0
a_hi a_lo ←src1
PACKLH2
31 16 15 0
a_lo b_hi ←dst
Execution if (cond) {
msb16(src2) → lsb16(dst);
lsb16(src1) → msb16(dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
Example 2
4-504 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.237 PACKLH2
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-505
Submit Documentation Feedback
4.238 PACKL4
— www.ti.com
4.238 PACKL4
Pack Four Low Bytes Into Four 8-Bit Halfwords
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 1 0 0 0 1 1 0 s p
5 1 1 1
Description Moves the low bytes of the two halfwords in src1 and src2, and packs them into dst. The
bytes from src1 are packed into the most-significant bytes of dst, and the bytes from src2
are packed into the least-significant bytes of dst.
• The low byte of the upper halfword of src1 is moved to the upper byte of the upper
halfword of dst. The low byte of the lower halfword of src1 is moved to the lower
byte of the upper halfword of dst.
• The low byte of the upper halfword of src2 is moved to the upper byte of the lower
halfword of dst. The low byte of the lower halfword of src2 is moved to the lower
byte of the lower halfword of dst.
31 24 23 16 15 8 7 0
a_3 a_2 a_1 a_0 ←src1
PACKL4
31 24 23 16 15 8 7 0
a_2 a_0 b_2 b_0 ←dst
Execution if (cond) {
byte2(src1) → byte3(dst);
byte0(src1) → byte2(dst);
byte2(src2) → byte1(dst);
byte0(src2) → byte0(dst)
4-506 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.238 PACKL4
www.ti.com —
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
A2 37 89 F2 3Ah A2 37 89 F2 3Ah
A8 04 B8 49 75h A8 04 B8 49 75h
Example 2
B2 01 24 24 51h B2 01 24 24 51h
B8 01 A6 A0 51h B8 01 A6 A0 51h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-507
Submit Documentation Feedback
4.239 QMPY32
— www.ti.com
4.239 QMPY32
4-Way SIMD Multiply, Packed Signed 32-bit
31 30 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
5 5 5 5
Description This is a 4x SIMD 32 by 32 multiplier operation where the output is the lower 32 bits of
the result.
Delay Slots 3
See Also
4-508 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.239 QMPY32
www.ti.com —
Example A3 == 0x80000000
A2 == 0x80000000
A1 == 0x7FFFFFFF
A0 == 0xFFFFFFFF
A11 == 0xFFFFFFF
A10 == 0x8000000
A9 == 0x7FFFFFF
A8 == 0xFFFFFFF
QMPY32 .M .....
A15 == 0x8000000
A14 == 0x0000000
A13 == 0x0000001
A12 == 0x0000001
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-509
Submit Documentation Feedback
4.240 QMPYSP
— www.ti.com
4.240 QMPYSP
4-Way SIMD Floating Point Multiply, Packed Single-Precision Floating Point
31 30 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
5 5 5 5
Description The QMPYSP instruction performs four floating point multiplies of the four pairs of
single precision floating point values contained in qwop1 and qwop. This instruction
cannot use the crosspath.
Special Cases:
1. If one source is SNaN or QNaN, the result is a signed NaN_out and the NANn bit
is set. If either source is SNaN, the INVAL bit is set also. The sign of NaN_out is
the XOR to the input signs.
2. Signed infinity multiplied by signed infinity or a normalized number (other than
signed zero) returns signed infinity. Signed infinity multiplied by signed zero (or
denormal) returns a signed NaN_out and sets the INVAL bit.
3. If one or both source are signed zero, the result is signed zero unless the other
source is a NaN or signed infinity, in which case the result is signed NaN_out.
4. If signed zero is multiplied by signed infinity, the result is signed NaN_out and
the INVAL bit is set.
5. A denormalized source is treated as signed zero and the DENn bit is set. The
INEX bit is set except when the other source is signed infinity, signed NaN, OR
signed zero. Therefore, a signed infinity multiplied by a denormalized number
gives a signed NaN_out and sets the INVAL bit.
6. If rounding is performed, the INEX bit is set.
Delay Slots 3
4-510 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.240 QMPYSP
www.ti.com —
Example FA3 == 0x80000000
A2 == 0x80000000
A1 == 0x7FFFFFFF
A0 == 0xFFFFFFFF
A11 == 0xFFFFFFF
A10 == 0x8000000
A9 == 0x7FFFFFF
A8 == 0xFFFFFFF
QMPY32 .M .....
A15 == 0x8000000
A14 == 0x0000000
A13 == 0x0000001
A12 == 0x0000001
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-511
Submit Documentation Feedback
4.241 QSMPY32R1
— www.ti.com
4.241 QSMPY32R1
4-Way SIMD Multiply with Saturation and Rounding, Packed Signed 32-bit
31 30 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
5 5 5 5
Description This is a 4x SIMD fractional 32 by 32 multiplier operation where the output is kept at
32 bits precision. This instruction is the same as MPY32 except the result is shifted by
31 bits to the right and rounded. This normalizes the result to lie within -1 and 1 in a
Q31 fractional number system. The case where the inputs are maximum negative
requires saturation otherwise the result will overflow.
Delay Slots 3
See Also
4-512 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.241 QSMPY32R1
www.ti.com —
Example CSR == 0x00000000 ; 4
A7 == 0x00000001
A6 == 0x00000001
A5 == 0x00000001
A4 == 0x00000001
A11 == 0x00000001
A10 == 0x00000001
A9 == 0x00000001
A8 == 0x00000001
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0x00000000
A2 == 0x00000000
A1 == 0x00000000
A0 == 0x00000000
CSR= 0x00000000 ; 4
;[1]*[1]->[0];
CSR == 0x00000000 ; 4
A7 == 0x00000001
A6 == 0x00000001
A5 == 0x00000001
A4 == 0x00000001
A11 == 0x3FFFFFFF
A10 == 0x3FFFFFFF
A9 == 0x3FFFFFFF
A8 == 0x3FFFFFFF
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0x00000000
A2 == 0x00000000
A1 == 0x00000000
A0 == 0x00000000
CSR= 0x00000000 ; 4
;[1]*[2^30-1]->[0];
CSR == 0x00000000 ; 4
A7 == 0x00000001
A6 == 0x00000001
A5 == 0x00000001
A4 == 0x00000001
A11 == 0x40000000
A10 == 0x40000000
A9 == 0x40000000
A8 == 0x40000000
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0x00000001
A2 == 0x00000001
A1 == 0x00000001
A0 == 0x00000001
CSR= 0x00000000 ; 4
;[1]*[2^30]->[1];
CSR == 0x00000000 ; 4
A7 == 0x19680828
A6 == 0x19680828
A5 == 0x19680828
A4 == 0x19680828
A11 == 0x19700520
A10 == 0x19700520
A9 == 0x19700520
A8 == 0x19700520
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0x050C8DA3
A2 == 0x050C8DA3
A1 == 0x050C8DA3
A0 == 0x050C8DA3
CSR= 0x00000000 ; 4
;[426248232]*[426771744]->[84708771];
CSR == 0x00000000 ; 4
A7 == 0x19680828
A6 == 0x19680828
A5 == 0x19680828
A4 == 0x19680828
A11 == 0xE68FFAE0
A10 == 0xE68FFAE0
A9 == 0xE68FFAE0
A8 == 0xE68FFAE0
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0xFAF3725D
A2 == 0xFAF3725D
A1 == 0xFAF3725D
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-513
Submit Documentation Feedback
4.241 QSMPY32R1
— www.ti.com
A0 == 0xFAF3725D
CSR= 0x00000000 ; 4
;[426248232]*[-426771744]->[-84708771];
CSR == 0x00000000 ; 4
A7 == 0xE697F7D8
A6 == 0xE697F7D8
A5 == 0xE697F7D8
A4 == 0xE697F7D8
A11 == 0xE68FFAE0
A10 == 0xE68FFAE0
A9 == 0xE68FFAE0
A8 == 0xE68FFAE0
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0x050C8DA3
A2 == 0x050C8DA3
A1 == 0x050C8DA3
A0 == 0x050C8DA3
CSR= 0x00000000 ; 4
;[-426248232]*[-426771744]->[84708771];
CSR == 0x00000000 ; 4
A7 == 0x80000000
A6 == 0x80000000
A5 == 0x80000000
A4 == 0x80000000
A11 == 0x80000000
A10 == 0x80000000
A9 == 0x80000000
A8 == 0x80000000
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0x7FFFFFFF
A2 == 0x7FFFFFFF
A1 == 0x7FFFFFFF
A0 == 0x7FFFFFFF
CSR= 0x00000200 ; 4
;[-2^31]*[-2^31]=[2^31-1];
CSR == 0x00000000 ; 4
A7 == 0x80000000
A6 == 0x80000000
A5 == 0x80000000
A4 == 0x80000000
A11 == 0x80000001
A10 == 0x80000001
A9 == 0x80000001
A8 == 0x80000001
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0x7FFFFFFF
A2 == 0x7FFFFFFF
A1 == 0x7FFFFFFF
A0 == 0x7FFFFFFF
CSR= 0x00000000 ; 4
;[-2^31]*[-2^31+1]=[2^31-1];
CSR == 0x00000000 ; 4
A7 == 0x80000001
A6 == 0x80000001
A5 == 0x80000001
A4 == 0x80000001
A11 == 0x80000001
A10 == 0x80000001
A9 == 0x80000001
A8 == 0x80000001
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0x7FFFFFFE
A2 == 0x7FFFFFFE
A1 == 0x7FFFFFFE
A0 == 0x7FFFFFFE
CSR= 0x00000000 ; 4
;[-2^31+1]*[-2^31+1]=[2^31-2];
CSR == 0x00000000 ; 4
A7 == 0x80000000
A6 == 0x80000000
A5 == 0x80000000
A4 == 0x80000000
A11 == 0x00000000
A10 == 0x00000000
A9 == 0x00000000
A8 == 0x00000000
4-514 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.241 QSMPY32R1
www.ti.com —
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0x00000000
A2 == 0x00000000
A1 == 0x00000000
A0 == 0x00000000
CSR= 0x00000000 ; 4
;[-2^31]*[0]=[0];
CSR == 0x00000000 ; 4
B7 == 0x80000000
B6 == 0x80000000
B5 == 0x80000000
B4 == 0x80000000
B11 == 0x80000000
B10 == 0x80000000
B9 == 0x80000000
B8 == 0x80000000
QSMPY32R1 .M B7:B6:B5:B4,B11:B10:B9:B8,B3:B2:B1:B0
B3 == 0x7FFFFFFF
B2 == 0x7FFFFFFF
B1 == 0x7FFFFFFF
B0 == 0x7FFFFFFF
CSR= 0x00000200 ; 4
;[-2^31]*[-2^31]=[2^31-1];
A7 == 0x80000000
A6 == 0x80000000
A5 == 0x80000000
A4 == 0x80000000
A11 == 0xffffffff
A10 == 0xffffffff
A9 == 0xffffffff
A8 == 0xffffffff
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0x00000001
A2 == 0x00000001
A1 == 0x00000001
A0 == 0x00000001
A7 == 0xffffffff
A6 == 0xffffffff
A5 == 0xffffffff
A4 == 0xffffffff
A11 == 0x80000000
A10 == 0x80000000
A9 == 0x80000000
A8 == 0x80000000
QSMPY32R1 .M A7:A6:A5:A4,A11:A10:A9:A8,A3:A2:A1:A0
A3 == 0x00000001
A2 == 0x00000001
A1 == 0x00000001
A0 == 0x00000001
B7 == 0x80000000
B6 == 0x80000000
B5 == 0x80000000
B4 == 0x80000000
B11 == 0xffffffff
B10 == 0xffffffff
B9 == 0xffffffff
B8 == 0xffffffff
QSMPY32R1 .M B7:B6:B5:B4,B11:B10:B9:B8,B3:B2:B1:B0
B3 == 0x00000001
B2 == 0x00000001
B1 == 0x00000001
B0 == 0x00000001
B7 == 0xffffffff
B6 == 0xffffffff
B5 == 0xffffffff
B4 == 0xffffffff
B11 == 0x80000000
B10 == 0x80000000
B9 == 0x80000000
B8 == 0x80000000
QSMPY32R1 .M B7:B6:B5:B4,B11:B10:B9:B8,B3:B2:B1:B0
B3 == 0x00000001
B2 == 0x00000001
B1 == 0x00000001
B0 == 0x00000001
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-515
Submit Documentation Feedback
4.242 RCPDP
— www.ti.com
4.242 RCPDP
Double-Precision Floating-Point Reciprocal Approximation
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Reserved x 1 0 1 1 0 1 1 0 0 0 s p
5 1 1 1
The RCPDP instruction provides the correct exponent, and the mantissa is accurate to
the eighth binary position (therefore, mantissa error is less than 2-8). This estimate can
be used as a seed value for an algorithm to compute the reciprocal to greater accuracy.
x[0], the seed value for the algorithm, is given by RCPDP. For each iteration, the
accuracy doubles. Thus, with one iteration, accuracy is 16 bits in the mantissa; with the
second iteration, the accuracy is 32 bits; with the third iteration, the accuracy is the full
52 bits.
Note—
1) If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2 bits are set.
2) If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3) If src2 is a signed denormalized number, signed infinity is placed in dstand the
DIV0, INFO, OVER, INEX, and DEN2 bits are set.
4) If src2 is signed 0, signed infinity is placed in dst and the DIV0 and INFO bits
are set.
5) If src2 is signed infinity, signed 0 is placed in dst.
4-516 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.242 RCPDP
www.ti.com —
6) If the result underflows, signed 0 is placed in dst and the INEX and UNDER
bits are set. Underflow occurs when 21022 < src2 < infinity.
Execution if (cond) rcp(src2) → dst
else nop
Pipeline
Pipeline Stage E1 E2
Read src2_l, src2_h
Written dst_l dst_h
Unit in use .S
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 1
A1:A0 4010 0000h 0000 0000h A1:A0 4010 0000h 0000 0000h 4.00
A3:A2 xxxx xxxxh xxxx xxxxh A3:A2 3FD0 0000h 0000 0000h 0.25
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-517
Submit Documentation Feedback
4.243 RCPSP
— www.ti.com
4.243 RCPSP
Single-Precision Floating-Point Reciprocal Approximation
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 1 1 1 1 0 1 1 0 0 0 s p
1 1 1
The RCPSP instruction provides the correct exponent, and the mantissa is accurate to
the eighth binary position (therefore, mantissa error is less than 2-8). This estimate can
be used as a seed value for an algorithm to compute the reciprocal to greater accuracy.
x[0], the seed value for the algorithm, is given by RCPSP. For each iteration, the
accuracy doubles. Thus, with one iteration, accuracy is 16 bits in the mantissa; with the
second iteration, the accuracy is the full 23 bits.
Note—
1) If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2 bits are set.
2) If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3) If src2 is a signed denormalized number, signed infinity is placed in dstand the
DIV0, INFO, OVER, INEX, and DEN2 bits are set.
4) If src2 is signed 0, signed infinity is placed in dst and the DIV0 and INFO bits
are set.
5) If src2 is signed infinity, signed 0 is placed in dst.
6) If the result underflows, signed 0 is placed in dst and the INEX and UNDER
bits are set. Underflow occurs when 2126 < src2 < infinity.
Execution if (cond) rcp(src2) → dst
4-518 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.243 RCPSP
www.ti.com —
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .S
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-519
Submit Documentation Feedback
4.244 RINT
— www.ti.com
4.244 RINT
Restore Previous Enable State
Syntax RINT
unit = none
Compatibility
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 1 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description Copies the contents of the SGIE bit in TSR into the GIE bit in TSR and CSR, and clears
the SGIE bit in TSR. The value of the SGIE bit in TSR is used for the current cycle as
the GIE indication; if restoring the GIE bit to 1, interrupts are enabled and can be taken
after the E1 phase containing the RINT instruction.
The CPU may service a maskable interrupt in the cycle immediately following the
RINT instruction. See section 5.2 for details.
The RINT instruction cannot be placed in parallel with: MVC reg, TSR; MVC reg,
CSR; B IRP; B NRP; NOP n; DINT; SPKERNEL; SPKERNELR; SPLOOP;
SPLOOPD; SPLOOPW; SPMASK; or SPMASKR.
Note—The use of the DINT and RINT instructions in a nested manner, like the
following code:
DINT
DINT
RINT
RINT
leaves interrupts disabled. The first DINT leaves TSR.GIE cleared to 0, so the
second DINT leaves TSR,.SGIE cleared to 0. The RINT instructions, therefore,
copy zero to TSR.GIE (leaving interrupts disabled).
Execution Enable interrupts in current cycle
Delay Slots 0
4-520 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.245 ROTL
www.ti.com —
4.245 ROTL
Rotate Left
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 6 5 4 3 2 1 0
src1 x 0 op 1 1 0 0 s p
5 1 5 1 1
Description Rotates the 32-bit value of src2 to the left, and places the result in dst. The number of
bits to rotate is given in the 5 least-significant bits of src1. Bits 5 through 31 of src1 are
ignored and may be non-zero.
31 24 23 16 15 8 7 0
abcdefgh ijklmnop qrstuvwx yzABCDEF ←src2
ROTL
31 0
ijklmnopqrstuvwxyzABCDEFabcdefgh ←dst
(for src1 = 8)
Execution if (cond) (src2 << src1) | (src2 >> (32 - src1)) → dst
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-521
Submit Documentation Feedback
4.245 ROTL
— www.ti.com
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
Example 2
4-522 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.246 RPACK2
www.ti.com —
4.246 RPACK2
Shift With Saturation and Pack Two 16 MSBs Into Upper and Lower Register Halves
Compatibility
Opcode
31 30 29 28 27 23 22 18
0 0 0 1 dst src2
5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 0 1 1 1 1 0 0 s p
5 1 1 1
Description src1 and src2 are shifted left by 1 with saturation. The 16 most-significant bits of the
shifted src1 value are placed in the 16 most-significant bits of dst. The 16
most-significant bits of the shifted src2 value are placed in the 16 least-significant bits
of dst.
If either value saturates, the S1 or S2 bit in SSR and the SAT bit in CSR are written one
cycle after the result is written to dst.
31 16 15 0
a_hi a_lo ←src1
RPACK2
↓ ↓
31 16 15 0
sat(a_hi << 1) sat(b_hi << 1) ←dst
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-523
Submit Documentation Feedback
4.246 RPACK2
— www.ti.com
Delay Slots 0
Examples Example 1
A0 FEDCBA98h A2 FDB92468h
A1 12345678h
Example 2
B0 87654321h B2 80002468h
A1 12345678h
4-524 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.247 RSQRDP
www.ti.com —
4.247 RSQRDP
Double-Precision Floating-Point Square-Root Reciprocal Approximation
Opcode
31 29 28 27 23 22 18
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Reserved x 1 0 1 1 1 0 1 0 0 0 s p
5 1 1 1
The RSQRDP instruction provides the correct exponent, and the mantissa is accurate
to the eighth binary position (therefore, mantissa error is less than 2-8). This estimate
can be used as a seed value for an algorithm to compute the reciprocal square root to
greater accuracy.
x[0], the seed value for the algorithm is given by RSQRDP. For each iteration the
accuracy doubles. Thus, with one iteration, the accuracy is 16 bits in the mantissa; with
the second iteration, the accuracy is 32 bits; with the third iteration, the accuracy is the
full 52 bits.
Note—
1) If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2 bits are set.
2) If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3) If src2 is a negative, nonzero, nondenormalized number, NaN_out is placed in
dst and the INVAL bit is set.
4) If src2 is a signed denormalized number, signed infinity is placed in dst and the
DIV0, INEX, and DEN2 bits are set.
5) If src2 is signed 0, signed infinity is placed in dst and the DIV0 and INFO bits
are set. The Newton-Rhapson approximation cannot be used to calculate the
square root of 0 because infinity multiplied by 0 is invalid.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-525
Submit Documentation Feedback
4.247 RSQRDP
— www.ti.com
else nop
Pipeline
Pipeline Stage E1 E2
Read src2_l, src2_h
Written dst_l dst_h
Unit in use .S
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 1
A1:A0 4010 0000h 0000 0000h A1:A0 4010 0000h 0000 0000h 4.0
A3:A2 xxxx xxxxh xxxx xxxxh A3:A2 3FE0 0000h 0000 0000h 0.5
4-526 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.248 RSQRSP
www.ti.com —
4.248 RSQRSP
Single-Precision Floating-Point Square-Root Reciprocal Approximation
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 1 1 1 1 1 0 1 0 0 0 s p
1 1 1
The RSQRSP instruction provides the correct exponent, and the mantissa is accurate
to the eighth binary position (therefore, mantissa error is less than 2-8). This estimate
can be used as a seed value for an algorithm to compute the reciprocal square root to
greater accuracy.
x[0], the seed value for the algorithm, is given by RSQRSP. For each iteration, the
accuracy doubles. Thus, with one iteration, accuracy is 16 bits in the mantissa; with the
second iteration, the accuracy is the full 23 bits.
Note—
1) If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2 bits are set.
2) If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3) If src2 is a negative, nonzero, nondenormalized number, NaN_out is placed in
dst and the INVAL bit is set.
4) If src2 is a signed denormalized number, signed infinity is placed in dst and the
DIV0, INEX, and DEN2 bits are set.
5) If src2 is signed 0, signed infinity is placed in dst and the DIV0 and INFO bits
are set. The Newton-Rhapson approximation cannot be used to calculate the
square root of 0 because infinity multiplied by 0 is invalid.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-527
Submit Documentation Feedback
4.248 RSQRSP
— www.ti.com
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
Example 2
4-528 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.249 SADD
www.ti.com —
4.249 SADD
Add Two Signed Integers With Saturation
or
Opcode .L unit
31 29 28 27 23 22 18 17
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
Opcode .S unit
31 29 28 27 23 22 18 17
creg z dst src2 src1
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 0 0 0 1 0 0 0 s p
5 1 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-529
Submit Documentation Feedback
4.249 SADD
— www.ti.com
Description src1 is added to src2 and saturated, if an overflow occurs according to the following
rules:
1. If the dst is an int and src1 + src2 > 231 - 1, then the result is 231 - 1.
2. If the dst is an int and src1 + src2 < -231, then the result is -231.
3. If the dst is a long and src1 + src2 > 239 - 1, then the result is 239 - 1.
4. If the dst is a long and src1 + src2 < -239, then the result is -239.
The result is placed in dst. If a saturate occurs, the SAT bit in the control status register
(CSR) is set one cycle after dst is written.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
A1 5A2E51A3h
4-530 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.249 SADD
www.ti.com —
A2 012A3FA2h
A3 5B589145h
Example 2
A1 436771F2h
A2 5A2E51A3h
A3 7FFFFFFFh
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-531
Submit Documentation Feedback
4.249 SADD
— www.ti.com
B2 112A3FA2h
4-532 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.250 SADD2
www.ti.com —
4.250 SADD2
Add Two Signed 16-Bit Integers on Upper and Lower Register Halves With Saturation
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 0 0 0 1 1 0 0 s p
5 1 1 1
Description Performs 2s-complement addition between signed, packed 16-bit quantities in src1 and
src2. The results are placed in a signed, packed 16-bit format into dst.
For each pair of 16-bit quantities in src1 and src2, the sum between the signed 16-bit
value from src1 and the signed 16-bit value from src2 is calculated and saturated to
produce a signed 16-bit result. The result is placed in the corresponding position in dst.
Saturation is performed on each 16-bit result independently. For each sum, the
following tests are applied:
• If the sum is in the range - 215 to 2 15 - 1, inclusive, then no saturation is performed
and the sum is left unchanged.
• If the sum is greater than 215 - 1, then the result is set to 215 - 1.
• If the sum is less than - 215, then the result is set to - 215.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-533
Submit Documentation Feedback
4.250 SADD2
— www.ti.com
31 16 15 0
a_hi a_lo ←src1
SADD2
↓ ↓
31 16 15 0
sat(a_hi + b_hi) sat(a_lo + b_lo) ←dst
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Examples Example 1
Example 2
4-534 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.250 SADD2
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-535
Submit Documentation Feedback
4.251 SADDSUB
— www.ti.com
4.251 SADDSUB
Parallel SADD and SSUB Operations On Common Inputs
Opcode
31 30 29 28 27 24 23 22 18 17
4 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 1 1 0 1 1 0 s p
5 1 1 1
If either result saturates, the L1 or L2 bit in SSR and the SAT bit in CSR are written one
cycle after the results are written to dst_o:dst_e.
Delay Slots 0
Examples Example 1
A0 0700C005h A2 0700C006h
A1 FFFFFFFFh A3 0700C004h
4-536 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.251 SADDSUB
www.ti.com —
Example 2
SADDSUB .L2X B0,A1,B3:B2
B0 7FFFFFFFh B2 7FFFFFFEh
A1 00000001h B3 7FFFFFFFh
Example 3
A0 80000000h A2 80000000h
B1 00000001h A3 80000001h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-537
Submit Documentation Feedback
4.252 SADDSUB2
— www.ti.com
4.252 SADDSUB2
Parallel SADD2 and SSUB2 Operations On Common Inputs
Opcode
31 30 29 28 27 24 23 22 18 17
4 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 1 1 1 1 1 0 s p
5 1 1 1
For the SADD2 operation, the upper and lower halves of the src2 operand are added
with saturation to the upper and lower halves of the src1 operand. The values in src1
and src2 are treated as signed, packed 16-bit data and the results are written in signed,
packed 16-bit format into dst_o.
For the SSUB2 operation, the upper and lower halves of the src2 operand are subtracted
with saturation from the upper and lower halves of the src1 operand. The values in src1
and src2 are treated as signed, packed 16-bit data and the results are written in signed,
packed 16-bit format into dst_e.
Delay Slots 0
Examples Example 1
4-538 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.252 SADDSUB2
www.ti.com —
Example 2
SADDSUB2 .L2X B0,A1,B3:B2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-539
Submit Documentation Feedback
4.253 SADDSU2
— www.ti.com
4.253 SADDSU2
Add Two Signed and Unsigned 16-Bit Integers on Register Halves With Saturation
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 0 0 1 1 1 0 0 s p
5 1 1 1
For each pair of 16-bit quantities in src1 and src2, the sum between the unsigned 16-bit
value from src1 and the signed 16-bit value from src2 is calculated and saturated to
produce a signed 16-bit result. The result is placed in the corresponding position in dst.
Saturation is performed on each 16-bit result independently. For each sum, the
following tests are applied:
• If the sum is in the range 0 to 216 - 1, inclusive, then no saturation is performed
and the sum is left unchanged.
• If the sum is greater than 216 - 1, then the result is set to 216 - 1.
4-540 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.253 SADDSU2
www.ti.com —
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-541
Submit Documentation Feedback
4.254 SADDUS2
— www.ti.com
4.254 SADDUS2
Add Two Unsigned and Signed 16-Bit Integers on Register Halves With Saturation
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 0 0 1 1 1 0 0 s p
5 1 1 1
Description Performs 2s-complement addition between unsigned and signed, packed 16-bit
quantities. The values in src1 are treated as unsigned, packed 16-bit quantities; and the
values in src2 are treated as signed, packed 16-bit quantities. The results are placed in
an unsigned, packed 16-bit format into dst.
For each pair of 16-bit quantities in src1 and src2, the sum between the unsigned 16-bit
value from src1 and the signed 16-bit value from src2 is calculated and saturated to
produce a signed 16-bit result. The result is placed in the corresponding position in dst.
Saturation is performed on each 16-bit result independently. For each sum, the
following tests are applied:
• If the sum is in the range 0 to 216 - 1, inclusive, then no saturation is performed
and the sum is left unchanged.
• If the sum is greater than 216 - 1, then the result is set to 216 - 1.
• If the sum is less than 0, then the result is cleared to 0.
4-542 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.254 SADDUS2
www.ti.com —
31 16 15 0
ua_hi ua_lo ←src1
SADDUS2
↓ ↓
31 16 15 0
sat(ua_hi + sb_hi) sat(ua_lo + sb_lo) ←dst
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-543
Submit Documentation Feedback
4.254 SADDUS2
— www.ti.com
4-544 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.255 SADDU4
www.ti.com —
4.255 SADDU4
Add With Saturation, Four Unsigned 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 0 1 1 1 1 0 0 s p
5 1 1 1
Description Performs 2s-complement addition between unsigned, packed 8-bit quantities. The
values in src1 and src2 are treated as unsigned, packed 8-bit quantities and the results
are written into dst in an unsigned, packed 8-bit format.
For each pair of 8-bit quantities in src1 and src2, the sum between the unsigned 8-bit
value from src1 and the unsigned 8-bit value from src2 is calculated and saturated to
produce an unsigned 8-bit result. The result is placed in the corresponding position in
dst.
Saturation is performed on each 8-bit result independently. For each sum, the
following tests are applied:
• If the sum is in the range 0 to 28 - 1, inclusive, then no saturation is performed
and the sum is left unchanged.
• If the sum is greater than 28 - 1, then the result is set to 28 - 1.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-545
Submit Documentation Feedback
4.255 SADDU4
— www.ti.com
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ←src1
SADDU4
↓ ↓ ↓ ↓
31 24 23 16 15 8 7 0
sat(ua_3 + ub_3) sat(ua_2 + ub_2) sat(ua_1 + ub_1) sat(ua_0 + ub_0) ←dst
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
4-546 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.255 SADDU4
www.ti.com —
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-547
Submit Documentation Feedback
4.256 SAT
— www.ti.com
4.256 SAT
Saturate a 40-Bit Integer to a 32-Bit Integer
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 1 0 0 0 0 0 0 1 1 0 s p
1 1 1
Description A 40-bit src2 value is converted to a 32-bit value. If the value in src2 is greater than what
can be represented in 32-bits, src2 is saturated. The result is placed in dst. If a saturate
occurs, the SAT bit in the control status register (CSR) is set one cycle after dst is
written.
Execution if (cond){
if (src2 > (231 - 1)), (231 - 1) → dst
else if (src2 < -231), -231 → dst
else src231..0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
B5 xxxxxxxxh B5 7FFFFFFFh
4-548 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.256 SAT
www.ti.com —
B5 7FFFFFFFh
SSR 00000002h
Example 2
B5 xxxxxxxxh B5 7FFFFFFFh
B5 7FFFFFFFh
SSR 00000002h
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-549
Submit Documentation Feedback
4.256 SAT
— www.ti.com
B5 xxxxxxxxh B5 A1907321h
B5 A1907321h
SSR 00000000h
4-550 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.257 SET
www.ti.com —
4.257 SET
Set a Bit Field
or
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 8 7 6 5 4 3 2 1 0
csta cstb 1 0 0 0 1 0 s p
5 5 1 1
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 0 1 1 1 0 0 0 s p
5 1 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-551
Submit Documentation Feedback
4.257 SET
— www.ti.com
Description For cstb > csta, the field in src2 as specified by csta to cstb is set to all 1s in dst. The csta
and cstb operands may be specified as constants or in the 10 LSBs of the src1 register,
with cstb being bits 0-4 (src14..0) and csta being bits 5-9 (src19..5). csta is the LSB of the
field and cstb is the MSB of the field. In other words, csta and cstb represent the
beginning and ending bits, respectively, of the field to be set to all 1s in dst. The LSB
location of src2 is bit 0 and the MSB location of src2 is bit 31.
In the following example, csta is 15 and cstb is 23. For the register version of the
instruction, only the 10 LSBs of the src1 register are valid. If any of the 22 MSBs are
non-zero, the result is invalid.
cstb
csta
src2 X X X X X X X X 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
dst X X X X X X X X 1 1 1 1 1 1 1 1 1 X X X X X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
For cstb < csta, the src2 register is copied to dst. The csta and cstb operands may be
specified as constants or in the 10 LSBs of the src1 register, with cstb being bits 0−4
(src14..0) and csta being bits 5−9 (src19..5).
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
4-552 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.257 SET
www.ti.com —
A0 4B134A1Eh A0 4B134A1Eh
A1 xxxxxxxxh A1 4B3FFF9Eh
Example 2
B0 9ED31A31h B0 9ED31A31h
B1 0000C197h B1 0000C197h
B2 xxxxxxxxh B2 9EFFFA31h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-553
Submit Documentation Feedback
4.258 SHFL
— www.ti.com
4.258 SHFL
Shuffle
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 0 0 x 0 0 0 0 1 1 1 1 0 0 s p
1 1 1
Description Performs an interleave operation on the two halfwords in src2. The bits in the lower
halfword of src2 are placed in the even bit positions in dst, and the bits in the upper
halfword of src2 are placed in the odd bit positions in dst.
As a result, bits 0, 1, 2, ..., 14, 15 of src2 are placed in bits 0, 2, 4, ... , 28, 30 of dst. Likewise,
bits 16, 17, 18, .. 30, 31 of src2 are placed in bits 1, 3, 5, ..., 29, 31 of dst.
4-554 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.258 SHFL
www.ti.com —
31 16 15 0
abcdefghijklmnop ABCDEFGHIJKLMNOP ←src2
SHFL
31 16 15 0
aAbBcCdDeEfFgGhH iIjJkKlLmMnNoOpP ←dst
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
See Also DEAL
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-555
Submit Documentation Feedback
4.259 SHFL3
— www.ti.com
4.259 SHFL3
3-Way Bit Interleave On Three 16-Bit Values Into a 48-Bit Result
Opcode
31 30 29 28 27 23 22 18 17
5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 0 1 1 0 1 1 0 s p
5 1 1 1
Description Performs a 3-way bit interleave on three 16-bit values and creating a 48-bit result.
31 16 15 0
a15 a14 a13 ... a2 a1 a0 b15 b14 b13 ... b2 b1 b0 ←src1
SHFL3
31 16 15 0
0 0 0 ... 0 0 0 a15 b15 d15 ... b11 d11 a10 ←dst_o
4-556 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.259 SHFL3
www.ti.com —
{
result |= (inp0 >> I & 1) << (I × 3) ;
result |= (inp1 >> I & 1) << ((I × 3) + 1);
result |= (inp2 >> I & 1) << I ((I × 3) + 2)
}
Delay Slots 0
A0 87654321h A2 7E179306h
A1 12345678h A3 00008C11h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-557
Submit Documentation Feedback
4.260 SHL
— www.ti.com
4.260 SHL
Arithmetic Shift Left
or
Opcode
31 29 28 27 23 22 18 17
13 12 11 6 5 4 3 2 1 0
src1 x op 1 0 0 0 s p
5 1 6 1 1
Description The src2 operand is shifted to the left by the src1 operand. The result is placed in dst.
When a register is used, the six LSBs specify the shift amount and valid values are 0-40.
When an immediate is used, valid shift amounts are 0-31. If src2 is a register pair, only
the bottom 40 bits of the register pair are shifted. The upper 24 bits of the register pair
are unused.
4-558 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.260 SHL
www.ti.com —
If 39 < src1 < 64, src2 is shifted to the left by 40. Only the six LSBs of src1 are used by the
shifter, so any bits set above bit 5 do not affect execution.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
A0 29E3D31Ch A0 29E3D31Ch
A1 xxxxxxxxh A1 9E3D31C0h
Example 2
B0 419751A5h B0 419751A5h
B1 00000009h B1 00000009h
B2 xxxxxxxxh B2 2EA34A00h
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-559
Submit Documentation Feedback
4.260 SHL
— www.ti.com
B2 00000022h B2 00000000h
Example 4
A5:A4 FFFF FFFFh FFFF FFFFh A5:A4 FFFF FFFFh FFFF FFFFh
4-560 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.261 SHL2
www.ti.com —
4.261 SHL2
2-Way SIMD Shift Left, Packed Signed 16-bit
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
5 5 5 6
Description The SHL2 instruction performs a left right on packed 16-bit quantities. The values in
xop1 are viewed as two packed 16-bit quantities. The lower four bits of op2 or ucst4 are
treated as a shift amount. The same shift amount is applied to both input data. The
results are placed in a signed packed 16-bit format.
For each unsigned 16-bit quantity in xop1, the quantity is shifted left by the specified
number of bits. Bits shifted out of the most-significant bit of each 16-bit quantity are
discarded.
For correct operation bit 4 (the fifth bit) of the constant field must be set to 0.
31 24 23 16 15 8 7 0
absdefgh ijklmnop qrstuvwx yzABCDEF ←xdwop1
← ←
← ←
← ←
31 24 23 16 15 8 7 0
ijklmnop qrstuvwx yzABCDEF 00000000 ←dwdst (for op2=8)
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-561
Submit Documentation Feedback
4.261 SHL2
— www.ti.com
Delay Slots 0
Example A0 == 0x1234fedc
SHL2 .S A0,4,A15
A15 == 0x2340edc0
A0 == 0x1234fedc
A1 == 0x00000004
SHL2 .S A0,A1,A5
A5 == 0x2340edc0
B0 == 0xabcd1234
B1 == 0x00000008
SHL2 .S B0,B1,B5
B5 == 0xcd003400
4-562 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.262 SHLMB
www.ti.com —
4.262 SHLMB
Shift Left and Merge Byte
Opcode .L unit
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 0 0 0 1 1 1 0 s p
5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17
creg z dst src2 src1
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 0 0 1 1 1 0 0 s p
5 1 1 1
Description Shifts the contents of src2 left by 1 byte, and then the most-significant byte of src1 is
merged into the least-significant byte position. The result is placed in dst.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-563
Submit Documentation Feedback
4.262 SHLMB
— www.ti.com
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ←src1
SHLMB
31 24 23 16 15 8 7 0
ub_2 ub_1 ub_0 ua_3 ←dst
Execution if (cond){
ubyte2(src2) → ubyte3(dst);
ubyte1(src2) → ubyte2(dst);
ubyte0(src2) → ubyte1(dst);
ubyte3(src1) → ubyte0(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
Example 2
4-564 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.262 SHLMB
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-565
Submit Documentation Feedback
4.263 SHR
— www.ti.com
4.263 SHR
Arithmetic Shift Right
or
Opcode
31 29 28 27 23 22 18 17
13 12 11 6 5 4 3 2 1 0
src1 x op 1 0 0 0 s p
5 1 6 1 1
Description The src2 operand is shifted to the right by the src1 operand. The sign-extended result is
placed in dst. When a register is used, the six LSBs specify the shift amount and valid
values are 0-40. When an immediate value is used, valid shift amounts are 0-31. If src2
is a register pair, only the bottom 40 bits of the register pair are shifted. The upper 24
bits of the register pair are unused.
4-566 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.263 SHR
www.ti.com —
If 39 < src1 < 64, src2 is shifted to the right by 40. Only the six LSBs of src1 are used by
the shifter, so any bits set above bit 5 do not affect execution.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
A0 F12363D1h A0 F12363D1h
Example 2
B0 14925A41h B0 14925A41h
B1 00000012h B1 00000012h
B2 xxxxxxxxh B2 00000524h
Example 3
B2 00000019h B2 0000090Ah
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-567
Submit Documentation Feedback
4.263 SHR
— www.ti.com
Example 4
A5:A4 FFFF FFFFh FFFF FFFFh A5:A4 FFFF FFFFh FFFF FFFFh
4-568 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.264 SHR2
www.ti.com —
4.264 SHR2
Arithmetic Shift Right, Signed, Packed 16-Bit
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 1 1 1 1 1 0 0 s p
5 1 1 1
31 29 28 27 23 22 18 17
creg z dst src2 src1
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 0 0 0 1 0 0 0 s p
5 1 1 1
Description Performs an arithmetic shift right on signed, packed 16-bit quantities. The values in
src2 are treated as signed, packed 16-bit quantities. The lower 5 bits of src1 are treated
as the shift amount. The results are placed in a signed, packed 16-bit format into dst.
For each signed 16-bit quantity in src2, the quantity is shifted right by the number of
bits specified in the lower 5 bits of src1. Bits 5 through 31 of src1 are ignored and may
be non-zero. The shifted quantity is sign-extended, and placed in the corresponding
position in dst. Bits shifted out of the least-significant bit of the signed 16-bit quantity
are discarded.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-569
Submit Documentation Feedback
4.264 SHR2
— www.ti.com
31 16 15 0
abcdefgh ijklmnop qrstuvwx yzABCDEF ←src2
SHR2
31 16 15 0
aaaaaaaa abcdefgh qqqqqqqq qrstuvwx ←dst
(for src1 = 8)
Note—If the shift amount specified in src1 is in the range 16 to 31, the behavior
is identical to a shift value of 15.
Execution if (cond){
smsb16(src2) >> src1 → smsb16(dst);
slsb16(src2) >> src1 → slsb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Example 2
4-570 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.264 SHR2
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-571
Submit Documentation Feedback
4.265 SHRMB
— www.ti.com
4.265 SHRMB
Shift Right and Merge Byte
Opcode .L unit
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 0 0 1 0 1 1 0 s p
5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17
creg z dst src2 src1
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 0 1 0 1 1 0 0 s p
5 1 1 1
Description Shifts the contents of src2 right by 1 byte, and then the least-significant byte of src1 is
merged into the most-significant byte position. The result is placed in dst.
4-572 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.265 SHRMB
www.ti.com —
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ←src1
SHRMB
31 24 23 16 15 8 7 0
ua_0 ub_3 ub_2 ub_1 ←dst
Execution if (cond){
ubyte0(src1) → ubyte3(dst);
ubyte3(src2) → ubyte2(dst);
ubyte2(src2) → ubyte1(dst);
ubyte1(src2) → ubyte0(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-573
Submit Documentation Feedback
4.265 SHRMB
— www.ti.com
4-574 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.266 SHRU
www.ti.com —
4.266 SHRU
Logical Shift Right
or
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 6 5 4 3 2 1 0
src1 x op 1 0 0 0 s p
5 1 6 1 1
Description The src2 operand is shifted to the right by the src1 operand. The zero-extended result
is placed in dst. When a register is used, the six LSBs specify the shift amount and valid
values are 0-40. When an immediate value is used, valid shift amounts are 0-31. If src2
is a register pair, only the bottom 40 bits of the register pair are shifted. The upper 24
bits of the register pair are unused.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-575
Submit Documentation Feedback
4.266 SHRU
— www.ti.com
If 39 < src1 < 64, src2 is shifted to the right by 40. Only the six LSBs of src1 are used by
the shifter, so any bits set above bit 5 do not affect execution.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
A0 F12363D1h A0 F12363D1h
A1 xxxxxxxxh A1 00F12363h
Example 2
A5:A4 FFFF FFFFh FFFF FFFFh A5:A4 FFFF FFFFh FFFF FFFFh
4-576 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.267 SHRU2
www.ti.com —
4.267 SHRU2
Arithmetic Shift Right, Unsigned, Packed 16-Bit
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 1 0 0 0 1 1 0 0 s p
5 1 1 1
31 29 28 27 23 22 18 17
creg z dst src2 src1
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 0 0 1 1 0 0 0 s p
5 1 1 1
Description Performs an arithmetic shift right on unsigned, packed 16-bit quantities. The values in
src2 are treated as unsigned, packed 16-bit quantities. The lower 5 bits of src1 are
treated as the shift amount. The results are placed in an unsigned, packed 16-bit format
into dst.
For each unsigned 16-bit quantity in src2, the quantity is shifted right by the number of
bits specified in the lower 5 bits of src1. Bits 5 through 31 of src1 are ignored and may
be non-zero. The shifted quantity is zero-extended, and placed in the corresponding
position in dst. Bits shifted out of the least-significant bit of the signed 16-bit quantity
are discarded.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-577
Submit Documentation Feedback
4.267 SHRU2
— www.ti.com
31 16 15 0
abcdefgh ijklmnop qrstuvwx yzABCDEF ←src2
SHRU2
31 16 15 0
00000000 abcdefgh 00000000 qrstuvwx ←dst
(for src1 = 8)
Note—If the shift amount specified in src1 is in the range of 16 to 31, the dst will
be cleared to all zeros.
Execution if (cond){
umsb16(src2) >> src1 → umsb16(dst);
ulsb16(src2) >> src1 → ulsb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Example 2
4-578 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.267 SHRU2
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-579
Submit Documentation Feedback
4.268 SMPY
— www.ti.com
4.268 SMPY
Multiply Signed 16 LSB × Signed 16 LSB With Left Shift and Saturation
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 1 0 0 0 0 0 0 s p
5 1 1 1
Description The 16 least-significant bits of src1 operand is multiplied by the 16 least-significant bits
of the src2 operand. The result is left shifted by 1 and placed in dst. If the left-shifted
result is 8000 0000h, then the result is saturated to 7FFF FFFFh. If a saturate occurs, the
SAT bit in CSR is set one cycle after dst is written. The source operands are signed by
default.
Execution if (cond){
if (((lsb16(src1) × lsb16(src2)) << 1) != 8000 0000h),
((lsb16(src1) × lsb16(src2)) << 1) → dstelse 7FFF FFFFh → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
4-580 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.268 SMPY
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-581
Submit Documentation Feedback
4.269 SMPYH
— www.ti.com
4.269 SMPYH
Multiply Signed 16 MSB × Signed 16 MSB With Left Shift and Saturation
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 1 0 0 0 0 0 0 s p
5 1 1 1
Execution if (cond){
if (((msb16(src1) × msb16(src2)) << 1) != 8000 0000h),
((msb16(src1) × msb16(src2)) << 1) → dstelse 7FFF FFFFh → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
4-582 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.270 SMPYHL
www.ti.com —
4.270 SMPYHL
Multiply Signed 16 MSB × Signed 16 LSB With Left Shift and Saturation
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 1 0 0 0 0 0 0 s p
5 1 1 1
Description The 16 most-significant bits of the src1 operand is multiplied by the 16 least-significant
bits of the src2 operand. The result is left shifted by 1 and placed in dst. If the left-shifted
result is 80000000h, then the result is saturated to 7FFFFFFFh. If a saturation occurs,
the SAT bit in CSR is set one cycle after dst is written.
Execution if (cond){
if (((msb16(src1) × lsb16(src2)) << 1) != 8000 0000h),
((msb16(src1) × lsb16(src2)) << 1) → dstelse 7FFF FFFFh → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-583
Submit Documentation Feedback
4.270 SMPYHL
— www.ti.com
4-584 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.271 SMPYLH
www.ti.com —
4.271 SMPYLH
Multiply Signed 16 LSB × Signed 16 MSB With Left Shift and Saturation
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 1 0 0 0 0 0 0 s p
5 1 1 1
Description The 16 least-significant bits of the src1 operand is multiplied by the 16 most-significant
bits of the src2 operand. The result is left shifted by 1 and placed in dst. If the left-shifted
result is 80000000h, then the result is saturated to 7FFFFFFFh. If a saturation occurs,
the SAT bit in CSR is set one cycle after dst is written.
Execution if (cond){
if (((lsb16(src1) × msb16(src2)) << 1) != 8000 0000h),
((lsb16(src1) × msb16(src2)) << 1) → dstelse 7FFF FFFFh → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-585
Submit Documentation Feedback
4.271 SMPYLH
— www.ti.com
4-586 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.272 SMPY2
www.ti.com —
4.272 SMPY2
Multiply Signed by Signed, 16 LSB × 16 LSB and 16 MSB × 16 MSB With Left Shift and
Saturation
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 0 0 1 1 1 0 0 s p
5 1 1 1
Description Performs two 16-bit by 16-bit multiplies between two pairs of signed, packed 16-bit
values, with an additional left-shift and saturate. The values in src1 and src2 are treated
as signed, packed 16-bit quantities. The two 32-bit results are written into a 64-bit
register pair.
The SMPY2 instruction produces two 16 × 16 products. Each product is shifted left
by 1. If the left-shifted result is 80000000h, the output value is saturated to 7FFFFFFFh.
The saturated product of the lower halfwords of src1 and src2 is written to the even
destination register, dst_e. The saturated product of the upper halfwords of src1 and
src2 is written to the odd destination register, dst_o.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-587
Submit Documentation Feedback
4.272 SMPY2
— www.ti.com
31 16 15 0
a_hi a_lo ←src1
× ×
SMPY2
63 32 31 0
sat((a_hi × b_hi) << 1) sat((a_lo × b_lo) << 1) ←dst_o:dst_e
Note—If either product saturates, the SAT bit is set in CSR one cycle after the
cycle that the result is written to dst_o:dst_e. If neither product saturates, the
SAT bit in CSR remains unaffected.
The SMPY2 instruction helps reduce the number of instructions required to perform
two 16-bit by 16-bit saturated multiplies on both the lower and upper halves of two
registers.
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
4-588 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.272 SMPY2
www.ti.com —
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 BED5 6150h 0EEA 8C58h
-1,093,312,176 250,252,376
Example 2
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 04D5 AB98h 2122 FD02h
81,111,960 555,941,122
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-589
Submit Documentation Feedback
4.273 SMPY32
— www.ti.com
4.273 SMPY32
Multiply Signed 32-Bit × Signed 32-Bit Into 64-Bit Result With Left Shift and
Saturation
Opcode
31 30 29 28 27 23 22 18 17
5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 0 0 1 1 1 0 0 s p
5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 and src2 are signed 32-bit values. The 64-bit
result is shifted left by 1 with saturation, and the 32 most-significant bits of the shifted
value are written to dst.
If the result saturates either on the multiply or the shift, the M1 or M2 bit in SSR and
the SAT bit in CSR are written one cycle after the results are written to dst.
Note—When both inputs are 8000 0000h, the shifted result cannot be
represented as a 32-bit signed value. In this case, the saturation value 7FFF
FFFFh is written into dst.
Execution msb32(sat((src2 × src1) << 1)) → dst
Delay Slots 3
Examples Example 1
A0 87654321h A2 EED8ED1Ah
4-590 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.273 SMPY32
www.ti.com —
A1 12345678h
Example 2
SMPY32 .L1 A0,A1,A2
A0 80000000h A2 7FFFFFFFh
A1 80000000h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-591
Submit Documentation Feedback
4.274 SPACK2
— www.ti.com
4.274 SPACK2
Saturate and Pack Two 16 LSBs Into Upper and Lower Register Halves
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 0 1 0 1 1 0 0 s p
5 1 1 1
Description Takes two signed 32-bit quantities in src1 and src2 and saturates them to signed 16-bit
quantities. The signed 16-bit results are then packed into a signed, packed 16-bit format
and written to dst. Specifically, the saturated 16-bit signed value of src1 is written to the
upper halfword of dst, and the saturated 16-bit signed value of src2 is written to the
lower halfword of dst.
Saturation is performed on each input value independently. The input values start as
signed 32-bit quantities, and are saturated to 16-bit quantities according to the
following rules:
• If the value is in the range - 215 to 215 - 1, inclusive, then no saturation is
performed and the value is merely truncated to 16 bits.
• If the value is greater than 215 - 1, then the result is set to 215 - 1.
• If the value is less than - 215, then the result is set to - 215.
31 16 15 0
00000000 ABCDEFGH IJKLMNOP QRSTUVWX ←src1
SPACK2
31 16 15 0
01111111 11111111 00YZ1234 56789ABC ←dst
4-592 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.274 SPACK2
www.ti.com —
The SPACK2 instruction is useful in code that manipulates 16-bit data at 32-bit
precision for its intermediate steps, but that requires the final results to be in a 16-bit
representation. The saturate step ensures that any values outside the signed 16-bit
range are clamped to the high or low end of the range before being truncated to 16 bits.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-593
Submit Documentation Feedback
4.274 SPACK2
— www.ti.com
4-594 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.275 SPACKU4
www.ti.com —
4.275 SPACKU4
Saturate and Pack Four Signed 16-Bit Integers Into Four Unsigned 8-Bit Halfwords
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 1 0 0 1 1 0 0 s p
5 1 1 1
Description Takes four signed 16-bit values and saturates them to unsigned 8-bit quantities. The
values in src1 and src2 are treated as signed, packed 16-bit quantities. The results are
written into dst in an unsigned, packed 8-bit format.
Each signed 16-bit quantity in src1 and src2 is saturated to an unsigned 8-bit quantity
as described below. The resulting quantities are then packed into an unsigned, packed
8-bit format. Specifically, the upper halfword of src1 is used to produce the
most-significant byte of dst. The lower halfword of src1 is used to produce the second
most-significant byte (bits 16 to 23) of dst. The upper halfword of src2 is used to
produce the third most-significant byte (bits 8 to 15) of dst. The lower halfword of src2
is used to produce the least-significant byte of dst.
31 16 15 0
00000000 ABCDEFGH 00000000 IJKLMNOP ←src1
SPACKU4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-595
Submit Documentation Feedback
4.275 SPACKU4
— www.ti.com
31 24 23 16 15 8 7 0
ABCDEFGH FFFFFFFF YZ123456 00000000 ←dst
The SPACKU4 instruction is useful in code that manipulates 8-bit data at 16-bit
precision for its intermediate steps, but that requires the final results to be in an 8-bit
representation. The saturate step ensures that any values outside the unsigned 8-bit
range are clamped to the high or low end of the range before being truncated to 8 bits.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Examples Example 1
Example 2
4-596 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.275 SPACKU4
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-597
Submit Documentation Feedback
4.276 SPDP
— www.ti.com
4.276 SPDP
Convert Single-Precision Floating-Point Value to Double-Precision Floating-Point
Value
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 0 0 1 0 1 0 0 0 s p
1 1 1
Description The single-precision value in src2 is converted to a double-precision value and placed
in dst.
Note—
1) If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2 bits are set.
2) If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3) If src2 is a signed denormalized number, signed 0 is placed in dst and the INEX
and DEN2 bits are set.
4) If src2 is signed infinity, INFO bit is set.
5) No overflow or underflow can occur.
Execution if (cond)dp(src2) → dst
else nop
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst_l dst_h
Unit in use .S
4-598 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.276 SPDP
www.ti.com —
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 1
A1:A0 xxxx xxxxh xxxx xxxxh A1:A0 4021 3333h 4000 0000h 8.6
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-599
Submit Documentation Feedback
4.277 SPINT
— www.ti.com
4.277 SPINT
Convert Single-Precision Floating Point to Integer
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
3 5 5 5 7
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Description The single precision value in src2 is converted to an integer and placed in dst.
Execution if(cond) {
int(src2) -> dst
}
else nop
Delay Slots 3
4-600 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.277 SPINT
www.ti.com —
Example FADCR== 0x00000000
A1 == 0x4109999a
SPINT .L A1,A0
A0 == 0x00000009
FADCR= 0x00000080
FADCR== 0x00000200
A1 == 0x4109999a
SPINT .L A1,A0
A0 == 0x00000008
FADCR= 0x00000280
FADCR== 0x00000400
A1 == 0x4109999a
SPINT .L A1,A0
A0 == 0x00000009
FADCR= 0x00000480
FADCR== 0x00000600
A1 == 0x4109999a
SPINT .L A1,A0
A0 == 0x0000008
FADCR= 0x00000680
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-601
Submit Documentation Feedback
4.278 SPKERNEL
— www.ti.com
4.278 SPKERNEL
Software Pipelined Loop (SPLOOP) Buffer Operation Code Boundary
unit = none
Opcode
31 30 29 28 27 22 21 20 19 18 17 16
0 0 0 0 fstg/fcyc 0 0 0 0 1 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description The SPKERNEL instruction is placed in parallel with the last execute packet of the
SPLOOP code body indicating there are no more instructions to load into the loop
buffer. The SPKERNEL instruction also controls at what point in the epilog the
execution of post-SPLOOP instructions begins. This point is specified in terms of stage
and cycle counts, and is derived from the fstg/fcyc field.
The stage and cycle values for both the post-SPLOOP fetch and reload cases are derived
from the fstg/fcyc field. The 6-bit field is interpreted as a function of the ii value from
the associated SPLOOP(D) instruction. The number of bits allocated to stage and cycle
vary according to ii. The value for cycle starts from the least-significant end; the value
for stage starts from the most-significant end, and they grow together. The number of
epilog stages and the number of cycles within those stages are shown in Table 4-11 on
page 4-603. The exact bit allocation to stage and cycle is shown in Table 4-12 on
page 4-603.
4-602 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.278 SPKERNEL
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-603
Submit Documentation Feedback
4.279 SPKERNELR
— www.ti.com
4.279 SPKERNELR
Software Pipelined Loop (SPLOOP) Buffer Operation Code Boundary
Syntax SPKERNELR
unit = none
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 1 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description The SPKERNELR instruction is placed in parallel with the last execute packet of the
SPLOOP code body indicating there are no more instructions to load into the loop
buffer. The SPKERNELR instruction also indicates that the execution of both
post-SPLOOP instructions and instructions reloaded from the buffer begin in the first
cycle of the epilog.
4-604 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.280 SPLOOP
www.ti.com —
4.280 SPLOOP
Software Pipelined Loop (SPLOOP) Buffer Operation
Syntax SPLOOP ii
unit = none
Opcode
31 29 28 27 23 22 21 20 19 18 17 16
creg z ii - 1 0 0 0 0 0 1 1
3 1 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description The SPLOOP instruction invokes the loop buffer mechanism. See Chapter 8 ‘‘Software
Pipelined Loop (SPLOOP) Buffer’’ on page 8-1 for more details.
When the SPLOOP instruction is predicated, it indicates that the loop is a nested loop
using the SPLOOP reload capability. The decision of whether to reload is determined
by the predicate register selected by the creg and z fields.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-605
Submit Documentation Feedback
4.281 SPLOOPD
— www.ti.com
4.281 SPLOOPD
Software Pipelined Loop (SPLOOP) Buffer Operation With Delayed Testing
Syntax SPLOOPD ii
unit = none
Opcode
31 29 28 27 23 22 21 20 19 18 17 16
creg z ii - 1 0 0 0 0 0 1 1
3 1 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 0 1 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description The SPLOOPD instruction invokes the loop buffer mechanism. The testing of the
termination condition is delayed for four cycles. See Chapter 8 ‘‘Software Pipelined
Loop (SPLOOP) Buffer’’ on page 8-1 for more details.
When the SPLOOPD instruction is predicated, it indicates that the loop is a nested
loop using the SPLOOP reload capability. The decision of whether to reload is
determined by the predicate register selected by the creg and z fields.
4-606 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.282 SPLOOPW
www.ti.com —
4.282 SPLOOPW
Software Pipelined Loop (SPLOOP) Buffer Operation With Delayed Testing and No
Epilog
Syntax SPLOOPW ii
unit = none
Opcode
31 29 28 27 23 22 21 20 19 18 17 16
creg z ii - 1 0 0 0 0 0 1 1
3 1 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description The SPLOOPW instruction invokes the loop buffer mechanism. The testing of the
termination condition is delayed for four cycles. See Chapter 8 ‘‘Software Pipelined
Loop (SPLOOP) Buffer’’ on page 8-1 for more details.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-607
Submit Documentation Feedback
4.283 SPMASK
— www.ti.com
4.283 SPMASK
Software Pipelined Loop (SPLOOP) Buffer Operation Load/Execution Control
unit = none
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
0 0 0 0 0 0 M2 M1 D2 D1 S2 S1 L2 L1 1 1
1 1 1 1 1 1 1 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description The SPMASK instruction serves two purposes within the SPLOOP mechanism:
1. The SPMASK instruction inhibits the execution of specified instructions from
the buffer within the current execute packet.
2. The SPMASK inhibits the loading of specified instructions into the buffer during
loading phase, although the instruction will execute normally.
The SPMASKed instruction must be the first instruction in the execute packet
containing it.
There are two ways to specify which instructions within the current execute packet will
be masked:
1. The functional units of the instruction can be specified as the SPMASK argument.
2. The instruction to be masked can be marked with a caret (^) in the instruction
code. The following three examples are equivalent:
SPMASKD2,L1
|| MV.D2 B0,B1
|| MV.L1 A0,A1
4-608 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.283 SPMASK
www.ti.com —
SPMASKD2
|| MV.D2 B0,B1
||^ MV.L1 A0,A1
SPMASK
||^ MV.D2 B0,B1
||^ MV.L1 A0,A1
The following two examples mask two MV instructions, but do not mask the MPY
instruction.
SPMASK D1, D2
|| MV .D1 A0,A1 ;This unit is SPMASKED
|| MV .D2 B0,B1 ;This unit is SPMASKED
|| MPY.L1 A0,B1 ;This unit is NOT SPMASKED
SPMASK
||^ MV .D1 A0,A1 ;This unit is SPMASKED
||^ MV .D2 B0,B1 ;This unit is SPMASKED
|| MPY.L1 A0,B1 ;This unit is NOT SPMASKED
Execution See Chapter 8 ‘‘Software Pipelined Loop (SPLOOP) Buffer’’ on page 8-1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-609
Submit Documentation Feedback
4.284 SPMASKR
— www.ti.com
4.284 SPMASKR
Software Pipelined Loop (SPLOOP) Buffer Operation Load/Execution Control
unit = none
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
0 0 0 0 0 0 M2 M1 D2 D1 S2 S1 L2 L1 1 1
1 1 1 1 1 1 1 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description The SPMASKR instruction serves three purposes within the SPLOOP mechanism.
Similar to the SPMASK instruction:
1. The SPMASKR instruction inhibits the execution of specified instructions from
the buffer within the current execute packet.
2. The SPMASKR instruction inhibits the loading of specified instructions into the
buffer during loading phase, although the instruction will execute normally.
In addition to the functionality of the SPMASK instruction:
3. The SPMASKR instruction controls the reload point for nested loops.
The SPMASKR instruction is placed in the execute packet (in the post-SPKERNEL
code) preceding the execute packet that will overlap with the first cycle of the reload
operation.
The SPKERNELR and the SPMASKR instructions cannot coexist in the same
SPLOOP operation. In the case where reload is intended to start in the first epilog cycle,
the SPKERNELR instruction is used and the SPMASKR instruction is not used for that
nested loop.
The SPMASKR instruction cannot be used in a loop using the SPLOOPW instruction.
The SPMASKR instruction must be the first instruction in the execute packet
containing it.
4-610 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.284 SPMASKR
www.ti.com —
There are two ways to specify which instructions within the current execute packet will
be masked:
1. The functional units of the instruction can be specified as the SPMASKR
argument.
2. The instruction to be masked can be marked with a caret (^) in the instruction
code. The following three examples are equivalent:
SPMASKR D2,L1
||MV.D2 B0,B1
||MV.L1 A0,A1
SPMASKR
|| MV.D2 B0,B1
||^MV.L1 A0,A1
SPMASKR
||^MV.D2 B0,B1
||^MV.L1 A0,A1
The following two examples mask two MV instructions, but do not mask the MPY
instruction. The presence of a caret (^) in the instruction code specifies which
instructions are SPMASKed.
SPMASKRD1,D2
|| MV .D1 A0,A1 ;This unit is SPMASKED
|| MV .D2 B0,B1 ;This unit is SPMASKED
|| MPY .L1 A0,B1 ;This unit is NOT SPMASKED
SPMASKR
||^ MV .D1 A0,A1 ;This unit is SPMASKED
||^ MV .D2 B0,B1 ;This unit is SPMASKED
|| MPY .L1 A0,B1 ;This unit is NOT SPMASKED
Execution See Chapter 8 ‘‘Software Pipelined Loop (SPLOOP) Buffer’’ on page 8-1
Example SPMASKR
||^ LDW .D1 *A0,A1 ;This unit is SPMASKed
||^ LDW .D2 *B0,B1 ;This unit is SPMASKed
|| MPY .M1 A3,A4,A5;This unit is NOT SPMASKed
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-611
Submit Documentation Feedback
4.285 SPTRUNC
— www.ti.com
4.285 SPTRUNC
Convert Single-Precision Floating-Point Value to Integer With Truncation
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 0 1 0 1 1 1 1 0 s p
1 1 1
Description The single-precision value in src2 is converted to an integer and placed in dst. This
instruction operates like SPINT except that the rounding modes in the floating-point
adder configuration register (FADCR) are ignored, and round toward zero (truncate)
is always used.
Note—
1) If src2 is NaN, the maximum signed integer (7FFF FFFFh or 8000 0000h) is
placed in dst and the INVAL bit is set.
2) If src2 is signed infinity or if overflow occurs, the maximum signed integer
(7FFF FFFFh or 8000 0000h) is placed in dst and the INEX and OVER bits are
set. Overflow occurs if src2 is greater than
231 −1 or less than −231.
3) If src2 is denormalized, 0000 0000h is placed in dst and INEX and DEN2 bits
are set.
4-612 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.285 SPTRUNC
www.ti.com —
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2
Written dst
Unit in use .L
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-613
Submit Documentation Feedback
4.286 SSHL
— www.ti.com
4.286 SSHL
Shift Left With Saturation
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 6 5 4 3 2 1 0
src1 x op 1 0 0 0 s p
5 1 6 1 1
4-614 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.286 SSHL
www.ti.com —
Description The src2 operand is shifted to the left by the src1 operand. The result is placed in dst.
When a register is used to specify the shift, the 5 least-significant bits specify the shift
amount. Valid values are 0 through 31, and the result of the shift is invalid if the shift
amount is greater than 31. The result of the shift is saturated to 32 bits. If a saturate
occurs, the SAT bit in CSR is set one cycle after dst is written.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-615
Submit Documentation Feedback
4.286 SSHL
— www.ti.com
Example 2
4-616 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.287 SSHVL
www.ti.com —
4.287 SSHVL
Variable Shift Left
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 1 0 0 1 1 0 0 s p
5 1 1 1
Description Shifts the signed 32-bit value in src2 to the left or right by the number of bits specified
by src1, and places the result in dst.
Saturation is performed when the value is shifted left under the following conditions:
• If the shifted value is in the range -231 to 231 - 1, inclusive, then no saturation is
performed, and the result is truncated to 32 bits.
• If the shifted value is greater than 231 - 1, then the result is saturated to 231 - 1.
• If the shifted value is less than - 231, then the result is saturated to - 231.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-617
Submit Documentation Feedback
4.287 SSHVL
— www.ti.com
31 0
abcdefgh ijklmnop qrstuvwx yzABCDEF ←src2
SSHVL
31 0
aaaaaaaa abcdefgh ijklmnop qrstuvwx ←dst
(for src1 = -8)
Note—If the shifted value is saturated, then the SAT bit is set in CSR one cycle
after the result is written to dst. If the shifted value is not saturated, then the
SAT bit is unaffected.
Execution if (cond){
if (0 <= src1 <= 31), sat(src2 << src1) → dst ;
if (-31 <= src1 < 0), (src2 >> abs(src1)) → dst;
if (src1 > 31), sat(src2 << 31) → dst;
if (src1 < -31), (src2 >> 31) → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
Example 2
4-618 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.287 SSHVL
www.ti.com —
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-619
Submit Documentation Feedback
4.288 SSHVR
— www.ti.com
4.288 SSHVR
Variable Shift Right
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 1 0 1 1 1 0 0 s p
5 1 1 1
Description Shifts the signed 32-bit value in src2 to the left or right by the number of bits specified
by src1, and places the result in dst.
Saturation is performed when the value is shifted left under the following conditions:
• If the shifted value is in the range -231 to 231 - 1, inclusive, then no saturation is
performed, and the result is truncated to 32 bits.
• If the shifted value is greater than 231 - 1, then the result is saturated to 231 - 1.
• If the shifted value is less than - 231, then the result is saturated to - 231.
4-620 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.288 SSHVR
www.ti.com —
31 0
abcdefgh ijklmnop qrstuvwx yzABCDEF ←src2
SSHVR
31 0
aaaaaaaa bcdefghi jklmnopq rstuvwxy ←dst
(for src1 = 7)
Note—If the shifted value is saturated, then the SAT bit is set in CSR one cycle
after the result is written to dst. If the shifted value is not saturated, then the
SAT bit is unaffected.
Execution if (cond){
if (0 <= src1 <= 31), (src2 >> src1) → dst;
if (-31 <= src1 < 0), sat(src2 << abs(src1)) → dst;
if (src1 > 31), (src2 >> 31) → dst;
if (src1 < -31), sat(src2 << 31) → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-621
Submit Documentation Feedback
4.288 SSHVR
— www.ti.com
Example 3
4-622 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.289 SSUB
www.ti.com —
4.289 SSUB
Subtract Two Signed Integers With Saturation
or
Opcode
31 29 28 27 23 22 18 17
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
Description src2 is subtracted from src1 and is saturated to the result size according to the following
rules:
1. If the result is an int and src1 - src2 > 231 - 1, then the result is 231 - 1.
2. If the result is an int and src1 - src2 < -231, then the result is -231.
3. If the result is a long and src1 - src2 > 239 - 1, then the result is 239 - 1.
4. If the result is a long and src1 - src2 < -239, then the result is -239.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-623
Submit Documentation Feedback
4.289 SSUB
— www.ti.com
The result is placed in dst. If a saturate occurs, the SAT bit in CSR is set one cycle after
dst is written.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
B1 5A2E51A3h
B2 802A3FA2h
B3 7FFFFFFFh
Example 2
4-624 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.289 SSUB
www.ti.com —
A0 436771F2h
A1 5A2E 51A3h
A2 E939204Fh
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-625
Submit Documentation Feedback
4.290 SSUB2
— www.ti.com
4.290 SSUB2
Subtract Two Signed 16-Bit Integers on Upper and Lower Register Halves With
Saturation
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 0 1 0 0 1 1 0 s p
5 1 1 1
Description Performs 2s-complement subtraction between signed, packed 16-bit quantities in src1
and src2. The results are placed in a signed, packed 16-bit format into dst.
For each pair of 16-bit quantities in src1 and src2, the difference between the signed
16-bit value from src1 and the signed 16-bit value from src2 is calculated and saturated
to produce a signed 16-bit result. The result is placed in the corresponding position in
dst.
Saturation is performed on each 16-bit result independently. For each sum, the
following tests are applied:
• If the difference is in the range - 215 to 2 15 - 1, inclusive, then no saturation is
performed and the sum is left unchanged.
• If the difference is greater than 215 - 1, then the result is set to 215 - 1.
• If the difference is less than - 215, then the result is set to - 215.
4-626 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.290 SSUB2
www.ti.com —
31 16 15 0
a_hi a_lo ←src1
- -
SSUB2
= =
31 16 15 0
sat(a_hi - b_hi) sat(a_lo - b_lo) ←dst
Delay Slots 0
A0 00070005h A2 00080006h
A1 FFFFFFFFh
Example 2
A0 00070005h A2 7FFF0006h
A1 8000FFFFh
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-627
Submit Documentation Feedback
4.291 STB
— www.ti.com
4.291 STB
Store Byte to Memory With a 5-Bit Unsigned Constant Offset or Register Offset
Syntax
Register Offset Unsigned Constant Offset
STB (.unit) src, *+baseR[offsetR] STB (.unit) src, *+baseR[ucst5]
unit = .D1 or .D2
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 9 8 7 6 5 4 3 2 1 0
offsetR/ucst5 mode 0 y 0 1 1 0 1 s p
5 4 1 1 1
Description Stores a byte to memory from a general-purpose register (src). Table 3-11 on page 3-28
describes the addressing generator options. The memory address is formed from a base
address register (baseR) and an optional offset that is either a register (offsetR) or a 5-bit
unsigned constant (ucst5).
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
For STB, the 8 LSBs of the src register are stored. src can be in either register file,
regardless of the .D unit or baseR or offsetR used. The s bit determines which file src is
read from: s = 0 indicates src will be in the A register file and s = 1 indicates src will be
in the B register file. The r bit should be cleared to 0.
4-628 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.291 STB
www.ti.com —
Increments and decrements default to 1 and offsets default to zero when no bracketed
register or constant is specified. Stores that do no modification to the baseR can use the
syntax *R. Square brackets, [ ], indicate that the ucst5 offset is left-shifted by 0.
Parentheses, ( ), can be used to set a nonscaled, constant offset. You must type either
brackets or parentheses around the specified offset, if you use the optional offset
parameter.
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D2
Delay Slots 0
For more information on delay slots for a store, see Chapter 5 ‘‘Pipeline’’ on page 5-1.
Examples Example 1
Example 2
mem 4024:27h xxxx xxxxh mem 4024:27h xxxx xxxxh mem 4024:27h xxxx 67xxh
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-629
Submit Documentation Feedback
4.291 STB
— www.ti.com
Example 3
mem 4020:23h xxxx xxxxh mem 4020:23h xxxx xxxxh mem 4020:23h xxxx 67xxh
Example 4
mem 4024:27h xxxx xxxxh mem 4024:27h xxxx xxxxh mem 4024:27h xx67 xxxxh
4-630 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.292 STB
www.ti.com —
4.292 STB
Store Byte to Memory With a 15-Bit Unsigned Constant Offset
unit = .D2
Opcode
31 29 28 27 23 22
3 1 5 15
8 7 6 5 4 3 2 1 0
ucst15 y 0 1 1 1 1 s p
15 1 1 1
Description Stores a byte to memory from a general-purpose register (src). The memory address is
formed from a base address register B14 (y = 0) or B15 (y = 1) and an offset, which is a
15-bit unsigned constant (ucst15). The assembler selects this format only when the
constant is larger than five bits in magnitude. This instruction executes only on the .D2
unit.
The offset, ucst15, is scaled by a left-shift of 0 bits. After scaling, ucst15 is added to
baseR. The result of the calculation is the address that is sent to memory. The
addressing arithmetic is always performed in linear mode.
For STB, the 8 LSBs of the src register are stored. src can be in either register file. The s
bit determines which file src is read from: s = 0 indicates src is in the A register file and
s = 1 indicates src is in the B register file.
Pipeline
Pipeline Stage E1
Read B14/B15, src
Written
Unit in use .D2
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-631
Submit Documentation Feedback
4.292 STB
— www.ti.com
4-632 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.293 STDW
www.ti.com —
4.293 STDW
Store Doubleword to Memory With a 5-Bit Unsigned Constant Offset or
Register Offset
Syntax
Register Offset Unsigned Constant Offset
STDW (.unit) src, *+baseR[offsetR] STDW (.unit) src, *+baseR[ucst5]
unit = .D1 or .D2
Opcode
31 29 28 27 23 22 18 17
13 12 9 8 7 6 5 4 3 2 1 0
offsetR/ucst5 mode 1 y 1 0 0 0 1 s p
5 4 1 1 1
Description Stores a 64-bit quantity to memory from a 64-bit register, src. Table 3-11 on page 3-28
describes the addressing generator options. Alignment to a 64-bit boundary is
required. The memory address is formed from a base address register (baseR) and an
optional offset that is either a register (offsetR) or a 5-bit unsigned constant (ucst5). If
an offset is not given, the assembler assigns an offset of zero.
Both offsetR and baseR must be in the same register file, and on the same side, as the .D
unit used. The y bit in the opcode determines the .D unit and register file used: y = 0
selects the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the
.D2 unit and baseR and offsetR from the B register file. The r bit has a value of 1 for the
STDW instruction.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-633
Submit Documentation Feedback
4.293 STDW
— www.ti.com
The offsetR/ucst5 is scaled by a left shift of 3 bits. After scaling, offsetR/ucst5 is added to,
or subtracted from, baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is the address to be accessed from memory.
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
The src pair can be in either register file, regardless of the .D unit or baseR or offsetR
used. The s bit determines which file src will be loaded from: s = 0 indicates src will be
in the A register file and s = 1 indicates src will be in the B register file.
Assembler Notes When no bracketed register or constant is specified, the assembler defaults increments
and decrements to 1 and offsets to 0. Stores that do no modification to the baseR can
use the assembler syntax *R. Square brackets, [ ], indicate that the ucst5 offset is
left-shifted by 3 for doubleword stores.
Parentheses, ( ), can be used to tell the assembler that the offset is a non-scaled, constant
offset. The assembler right shifts the constant by 3 bits for doubleword stores before
using it for the ucst5 field. After scaling by the STDW instruction, this results in the
same constant offset as the assembler source if the least-significant three bits are zeros.
For example, STDW (.unit) src, *+baseR (16) represents an offset of 16 bytes (2
doublewords), and the assembler writes out the instruction with ucst5 = 2. STDW
(.unit) src, *+baseR [16] represents an offset of 16 doublewords, or 128 bytes, and the
assembler writes out the instruction with ucst5 = 16.
Either brackets or parentheses must be typed around the specified offset if the optional
offset parameter is used. The register pair syntax always places the odd-numbered
register first, a colon, followed by the even-numbered register (that is, A1:A0, B1:B0,
A3:A2, B3:B2, etc.).
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D
Delay Slots 0
Examples Example 1
4-634 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.293 STDW
www.ti.com —
A3:A2 A176 3B28h 6041 AD65h A3:A2 A176 3B28h 6041 AD65h
Byte Memory Address 1009 1008 1007 1006 1005 1004 1003 1002 1001 1000
Data Value Before Store 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 00 A1 76 3B 28 60 41 AD 65
Example 2
A3:A2 A176 3B28h 6041 AD65h A3:A2 A176 3B28h 6041 AD65h
Byte Memory Address 100D 100C 100B 100A 1009 1008 1007 1006 1005 1004 1003
Data Value Before Store 00 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 00 A1 76 3B 28 60 41 AD 65 00
Example 3
A9:A8 ABCD EF98h 0123 4567h A9:A8 ABCD EF98h 0123 4567h
Byte Memory Address 4051 4050 404F 404E 404D 404C 404B 404A 4049 4048 4047
Data Value Before Store 00 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 00 AB CD EF 98 01 23 45 67 00
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-635
Submit Documentation Feedback
4.293 STDW
— www.ti.com
Example 4
A9:A8 ABCD EF98h 0123 4567h A9:A8 ABCD EF98h 0123 4567h
Byte Memory Address 4039 4038 4037 4036 4035 4034 4033 4032 4031 4030 402F
Data Value Before Store 00 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 00 AB CD EF 98 01 23 45 67 00
Example 5
A9:A8 ABCD EF98h 0123 4567h A9:A8 ABCD EF98h 0123 4567h
Byte Memory Address 4059 4058 4057 4056 4055 4054 4053 4052 4051 4050 404F
Data Value Before Store 00 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 00 AB CD EF 98 01 23 45 67 00
4-636 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.294 STH
www.ti.com —
4.294 STH
Store Halfword to Memory With a 5-Bit Unsigned Constant Offset or Register Offset
Syntax
Register Offset Unsigned Constant Offset
STH (.unit) src, *+baseR[offsetR] STH (.unit) src, *+baseR[ucst5]
unit = .D1 or .D2
Opcode
31 29 28 2 23 22 18 17
7
3 1 5 5 5
13 12 9 8 7 6 5 4 3 2 1 0
offsetR/ucst5 mode 0 y 1 0 1 0 1 s p
5 4 1 1 1
Description Stores a halfword to memory from a general-purpose register (src). Table 3-10 on
page 3-28 describes the addressing generator options. The memory address is formed
from a base address register (baseR) and an optional offset that is either a register
(offsetR) or a 5-bit unsigned constant (ucst5).
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
For STH, the 16 LSBs of the src register are stored. src can be in either register file,
regardless of the .D unit or baseR or offsetR used. The s bit determines which file src is
read from: s = 0 indicates src will be in the A register file and s = 1 indicates src will be
in the B register file. The r bit should be cleared to 0.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-637
Submit Documentation Feedback
4.294 STH
— www.ti.com
Increments and decrements default to 1 and offsets default to zero when no bracketed
register or constant is specified. Stores that do no modification to the baseR can use the
syntax *R. Square brackets, [ ], indicate that the ucst5 offset is left-shifted by 1.
Parentheses, ( ), can be used to set a nonscaled, constant offset. You must type either
brackets or parentheses around the specified offset, if you use the optional offset
parameter.
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D2
Delay Slots 0
For more information on delay slots for a store, see Chapter 5 ‘‘Pipeline’’ on page 5-1.
Examples Example 1
Example 2
4-638 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.294 STH
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-639
Submit Documentation Feedback
4.295 STH
— www.ti.com
4.295 STH
Store Halfword to Memory With a 15-Bit Unsigned Constant Offset
unit = .D2
Opcode
31 29 28 27 23 22
3 1 5 15
8 7 6 5 4 3 2 1 0
ucst15 y 1 0 1 1 1 s p
15 1 1 1
Description Stores a halfword to memory from a general-purpose register (src). The memory
address is formed from a base address register B14 (y = 0) or B15 (y = 1) and an offset,
which is a 15-bit unsigned constant (ucst15). The assembler selects this format only
when the constant is larger than five bits in magnitude. This instruction executes only
on the .D2 unit.
The offset, ucst15, is scaled by a left-shift of 1 bit. After scaling, ucst15 is added to baseR.
The result of the calculation is the address that is sent to memory. The addressing
arithmetic is always performed in linear mode.
For STH, the 16 LSBs of the src register are stored. src can be in either register file. The
s bit determines which file src is read from: s = 0 indicates src is in the A register file and
s = 1 indicates src is in the B register file.
Pipeline
Pipeline Stage E1
Read B14/B15, src
Written
Unit in use .D2
Delay Slots 0
4-640 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.296 STNDW
www.ti.com —
4.296 STNDW
Store Nonaligned Doubleword to Memory With a 5-Bit Unsigned Constant Offset or
Register Offset
Syntax
Register Offset Unsigned Constant Offset
STNDW (.unit) src, *+baseR[offsetR] STNDW (.unit) src, *+baseR[ucst5]
unit = .D1 or .D2
Opcode
31 29 28 27 24 23 22 18 17
3 1 4 1 5 5
13 12 9 8 7 6 5 4 3 2 1 0
offsetR/ucst5 mode 1 y 1 1 1 0 1 s p
5 4 1 1 1
Description Stores a 64-bit quantity to memory from a 64-bit register pair, src. Table 3-11 on
page 3-28 describes the addressing generator options. The STNDW instruction may
write a 64-bit value to any byte boundary. Thus alignment to a 64-bit boundary is not
required. The memory address is formed from a base address register (baseR) and an
optional offset that is either a register (offsetR) or a 5-bit unsigned constant (ucst5).
Both offsetR and baseR must be in the same register file and on the same side as the .D
unit used. The y bit in the opcode determines the .D unit and register file used: y = 0
selects the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the
.D2 unit and baseR and offsetR from the B register file.
The STNDW instruction supports both scaled offsets and non-scaled offsets. The sc
field is used to indicate whether the offsetR/ucst5 is scaled or not. If sc is 1 (scaled), the
offsetR/ucst5 is shifted left 3 bits before adding or subtracting from the baseR. If sc is 0
(nonscaled), the offsetR/ucst5 is not shifted before adding to or subtracting from the
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-641
Submit Documentation Feedback
4.296 STNDW
— www.ti.com
baseR. For the preincrement, predecrement, positive offset, and negative offset address
generator options, the result of the calculation is the address to be accessed in memory.
For postincrement or post-decrement addressing, the value of baseR before the
addition or subtraction is the address to be accessed from memory.
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
The src pair can be in either register file, regardless of the .D unit or baseR or offsetR
used. The s bit determines which file src will be loaded from: s = 0 indicates src will be
in the A register file and s = 1 indicates src will be in the B register file. The r bit has a
value of 1 for the STNDW instruction.
Parentheses, ( ), can be used to indicate to the assembler that the offset is a nonscaled
offset.
For example, STNDW (.unit) src, *+baseR (12) represents an offset of 12 bytes and the
assembler writes out the instruction with offsetC = 12 and sc = 0.
STNDW (.unit) src, *+baseR [16] represents an offset of 16 doublewords, or 128 bytes,
and the assembler writes out the instruction with offsetC = 16 and sc = 1.
Either brackets or parentheses must be typed around the specified offset if the optional
offset parameter is used.
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D
Delay Slots 0
Examples Example 1
4-642 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.296 STNDW
www.ti.com —
Byte Memory Address 1009 1008 1007 1006 1005 1004 1003 1002 1001 1000
Data Value Before Store 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 A1 76 3B 28 60 41 AD 65 00
Example 2
A3:A2 A176 3B28h 6041 AD65h A3:A2 A176 3B28h 6041 AD65h
Byte Memory Address 100B 100A 1009 1008 1007 1006 1005 1004 1003 1002 1001 1000
Data Value Before Store 00 00 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 A1 76 3B 28 60 41 AD 65 00 00 00
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-643
Submit Documentation Feedback
4.297 STNW
— www.ti.com
4.297 STNW
Store Nonaligned Word to Memory With a 5-Bit Unsigned Constant Offset or Register
Offset
Syntax
Register Offset Unsigned Constant Offset
STNW (.unit) src, *+baseR[offsetR] STNW (.unit) src, *+baseR[ucst5]
unit = .D1 or .D2
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 9 8 7 6 5 4 3 2 1 0
offsetR/ucst5 mode 1 y 1 0 1 0 1 s p
5 4 1 1 1
Description Stores a 32-bit quantity to memory from a 32-bit register, src. Table 3-11 on page 3-28
describes the addressing generator options. The STNW instruction may write a 32-bit
value to any byte boundary. Thus alignment to a 32-bit boundary is not required. The
memory address is formed from a base address register (baseR) and an optional offset
that is either a register (offsetR) or a 5-bit unsigned constant (ucst5).
Both offsetR and baseR must be in the same register file, and on the same side, as the .D
unit used. The y bit in the opcode determines the .D unit and register file used: y = 0
selects the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the
.D2 unit and baseR and offsetR from the B register file.
The offsetR/ucst5 is scaled by a left shift of 2 bits. After scaling, offsetR/ucst5 is added to,
or subtracted from, baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is the address to be accessed from memory.
4-644 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.297 STNW
www.ti.com —
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
The src can be in either register file, regardless of the .D unit or baseR or offsetR used.
The s bit determines which file src will be loaded from: s = 0 indicates src will be in the
A register file and s = 1 indicates src will be in the B register file. The r bit has a value of
1 for the STNW instruction.
Parentheses, ( ), can be used to tell the assembler that the offset is a non-scaled, constant
offset. The assembler right shifts the constant by 2 bits for word stores before using it
for the ucst5 field. After scaling by the STNW instruction, this results in the same
constant offset as the assembler source if the least-significant two bits are zeros.
For example, STNW (.unit) src,*+baseR (12) represents an offset of 12 bytes (3 words),
and the assembler writes out the instruction with ucst5 = 3.
STNW (.unit) src,*+baseR [12] represents an offset of 12 words, or 48 bytes, and the
assembler writes out the instruction with ucst5 = 12.
Either brackets or parentheses must be typed around the specified offset if the optional
offset parameter is used.
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D
Delay Slots 0
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-645
Submit Documentation Feedback
4.297 STNW
— www.ti.com
Byte Memory Address 1007 1006 1005 1004 1003 1002 1001 1000
Data Value Before Store 00 00 00 00 00 00 00 00
Data Value After Store 00 00 00 A1 76 3B 28 00
Example 2
Byte Memory Address 1007 1006 1005 1004 1003 1002 1001 1000
Data Value Before Store 00 00 00 00 00 00 00 00
Data Value After Store 00 A1 76 3B 28 00 00 00
4-646 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.298 STW
www.ti.com —
4.298 STW
Store Word to Memory With a 5-Bit Unsigned Constant Offset or Register Offset
Syntax
Register Offset Unsigned Constant Offset
STW (.unit) src, *+baseR[offsetR] STW (.unit) src, *+baseR[ucst5]
unit = .D1 or .D2
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 9 8 7 6 5 4 3 2 1 0
offsetR/ucst5 mode 0 y 1 1 1 0 1 s p
5 4 1 1 1
Description Stores a word to memory from a general-purpose register (src). Table 3-11 on
page 3-28 describes the addressing generator options. The memory address is formed
from a base address register (baseR) and an optional offset that is either a register
(offsetR) or a 5-bit unsigned constant (ucst5).
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
The addressing arithmetic that performs the additions and subtractions defaults to
linear mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular
mode by writing the appropriate value to the AMR (see ‘‘Addressing Mode Register
(AMR)’’ on page 2-12).
For STW, the entire 32-bits of the src register are stored. src can be in either register file,
regardless of the .D unit or baseR or offsetR used. The s bit determines which file src is
read from: s = 0 indicates src will be in the A register file and s = 1 indicates src will be
in the B register file. The r bit should be cleared to 0.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-647
Submit Documentation Feedback
4.298 STW
— www.ti.com
Increments and decrements default to 1 and offsets default to zero when no bracketed
register or constant is specified. Stores that do no modification to the baseR can use the
syntax *R. Square brackets, [ ], indicate that the ucst5 offset is left-shifted by 2.
Parentheses, ( ), can be used to set a nonscaled, constant offset. For example,
STW (.unit) src, *+baseR(12) represents an offset of 12 bytes; whereas,
STW (.unit) src, *+baseR[12] represents an offset of 12 words, or 48 bytes. You must
type either brackets or parentheses around the specified offset, if you use the optional
offset parameter.
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D2
Delay Slots 0
For more information on delay slots for a store, see Chapter 5 ‘‘Pipeline’’ on page 5-1.
Examples Example 1
Example 2
4-648 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.298 STW
www.ti.com —
mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh
mem 4034h xxxx xxxxh mem 4034h xxxx xxxxh mem 4034h 0123 4567h
Example 3
mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh
mem 4028h xxxx xxxxh mem 4028h xxxx xxxxh mem 4028h 0123 4567h
Example 4
mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh
mem 4038h xxxx xxxxh mem 4038h xxxx xxxxh mem 4038h 0123 4567h
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-649
Submit Documentation Feedback
4.299 STW
— www.ti.com
4.299 STW
Store Word to Memory With a 15-Bit Unsigned Constant Offset
unit = .D2
Opcode
31 29 28 27 23 22
3 1 5 15
8 7 6 5 4 3 2 1 0
ucst15 y 1 1 1 1 1 s p
15 1 1 1
Description Stores a word to memory from a general-purpose register (src). The memory address is
formed from a base address register B14 (y = 0) or B15 (y = 1) and an offset, which is a
15-bit unsigned constant (ucst15). The assembler selects this format only when the
constant is larger than five bits in magnitude. This instruction executes only on the .D2
unit.
The offset, ucst15, is scaled by a left-shift of 2 bits. After scaling, ucst15 is added to
baseR. The result of the calculation is the address that is sent to memory. The
addressing arithmetic is always performed in linear mode.
For STW, the entire 32-bits of the src register are stored. src can be in either register file.
The s bit determines which file src is read from: s = 0 indicates src is in the A register file
and s = 1 indicates src is in the B register file.
4-650 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.299 STW
www.ti.com —
Pipeline
Pipeline Stage E1
Read B14/B15, src
Written
Unit in use .D2
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-651
Submit Documentation Feedback
4.300 SUB
— www.ti.com
4.300 SUB
Subtract Two Signed Integers Without Saturation
or
or
SUB (.D1 or .D2) src2, src1, dst (if the cross path form is not used)
or
SUB (.D1 or .D2) src1, src2, dst (if the cross path form is used)
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
4-652 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.300 SUB
www.ti.com —
Opcode .S unit
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 6 5 4 3 2 1 0
src1 x op 1 0 0 0 s p
5 1 6 1 1
src2 - src1:
31 30 29 28 27 23 22 18 17
1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 1 0 1 1 1 0 0 s p
5 1 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-653
Submit Documentation Feedback
4.300 SUB
— www.ti.com
Description for .L1, .L2 and .S1, src2 is subtracted from src1. The result is placed in dst.
.S2 Opcodes
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 7 6 5 4 3 2 1 0
src1 op 1 0 0 0 0 s p
5 6 1 1
31 29 28 27 23 22 18 17
creg z dst src2 src1
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 1 0 0 1 1 0 0 s p
5 1 1 1
4-654 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.300 SUB
www.ti.com —
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S, or .D
Delay Slots 0
See Also ADD, SUBC, SUBU, SSUB, SUB2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-655
Submit Documentation Feedback
4.301 SUBAB
— www.ti.com
4.301 SUBAB
Subtract Using Byte Addressing Mode
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 7 6 5 4 3 2 1 0
src1 op 1 0 0 0 0 s p
5 6 1 1
Description src1 is subtracted from src2 using the byte addressing mode specified for src2. The
subtraction defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the
mode can be changed to circular mode by writing the appropriate value to the AMR
(see ‘‘Addressing Mode Register (AMR)’’ on page 2-12).The result is placed in dst.
Execution if (cond)src2 -a src1 → dst
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Delay Slots 0
A0 00000004h A0 00000004h
A5 00004000h A5 0000400Ch
4-656 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.301 SUBAB
www.ti.com —
1. BK0 = 3 →size = 16
A5 in circular addressing mode using BK0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-657
Submit Documentation Feedback
4.302 SUBABS4
— www.ti.com
4.302 SUBABS4
Subtract With Absolute Value, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 1 1 0 1 0 1 1 0 s p
5 1 1 1
Description Calculates the absolute value of the differences between the packed 8-bit data contained
in the source registers. The values in src1 and src2 are treated as unsigned, packed 8-bit
quantities. The result is written into dst in an unsigned, packed 8-bit format.
For each pair of unsigned 8-bit values in src1 and src2, the absolute value of the
difference is calculated. This result is then placed in the corresponding position in dst.
• The absolute value of the difference between src1 byte0 and src2 byte0 is placed in
byte0 of dst.
• The absolute value of the difference between src1 byte1 and src2 byte1 is placed in
byte1 of dst.
• The absolute value of the difference between src1 byte2 and src2 byte2 is placed in
byte2 of dst.
• The absolute value of the difference between src1 byte3 and src2 byte3 is placed in
byte3 of dst.
4-658 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.302 SUBABS4
www.ti.com —
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ←src1
- - - -
SUBABS4
= = = =
31 24 23 16 15 8 7 0
abs(ua_3 - ub_3) abs(ua_2 - ub_2) abs(ua_1 - ub_1) abs(ua_0 - ub_0) ←dst
Execution if (cond){
abs(ubyte0(src1) - ubyte0(src2)) → ubyte0(dst);
abs(ubyte1(src1) - ubyte1(src2)) → ubyte1(dst);
abs(ubyte2(src1) - ubyte2(src2)) → ubyte2(dst);
abs(ubyte3(src1) - ubyte3(src2)) → ubyte3(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-659
Submit Documentation Feedback
4.303 SUBAH
— www.ti.com
4.303 SUBAH
Subtract Using Halfword Addressing Mode
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 7 6 5 4 3 2 1 0
src1 op 1 0 0 0 0 s p
5 6 1 1
Description src1 is subtracted from src2 using the halfword addressing mode specified for src2. The
subtraction defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the
mode can be changed to circular mode by writing the appropriate value to the AMR
(see ‘‘Addressing Mode Register (AMR)’’ on page 2-12). src1 is left shifted by 1. The
result is placed in dst.
Execution if (cond)src2 -a src1 → dst
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Delay Slots 0
4-660 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.304 SUBAW
www.ti.com —
4.304 SUBAW
Subtract Using Word Addressing Mode
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 7 6 5 4 3 2 1 0
src1 op 1 0 0 0 0 s p
5 6 1 1
Description src1 is subtracted from src2 using the word addressing mode specified for src2. The
subtraction defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the
mode can be changed to circular mode by writing the appropriate value to the AMR
(see ‘‘Addressing Mode Register (AMR)’’ on page 2-12). src1 is left shifted by 2. The
result is placed in dst.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-661
Submit Documentation Feedback
4.304 SUBAW
— www.ti.com
A3 xxxxxxxxh A3 00000108h
A5 00000100h A5 00000100h
1. BK0 = 3 →size = 16
A5 in circular addressing mode using BK0
4-662 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.305 SUBC
www.ti.com —
4.305 SUBC
Subtract Conditionally and Shift—Used for Division
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 1 0 1 1 1 1 0 s p
5 1 1 1
Description Subtract src2 from src1. If result is greater than or equal to 0, left shift result by 1, add 1
to it, and place it in dst. If result is less than 0, left shift src1 by 1, and place it in dst. This
step is commonly used in division.
Execution if (cond){
if (src1 - src2 ≥ 0), ((src1 - src2) << 1) + 1 → dst
else (src1 << 1) → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-663
Submit Documentation Feedback
4.305 SUBC
— www.ti.com
Example 2
4-664 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.306 SUBDP
www.ti.com —
4.306 SUBDP
Subtract Two Double-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
Note—
1) This instruction takes the rounding mode from and sets the warning bits in the
floating-point adder configuration register (FADCR), not the floating-point
auxiliary configuration register (FAUCR) as for other .S unit instructions.
2) The source specific warning bits set in FADCR are set according to the registers
sources in the actual machine instruction and not according to the order of the
sources in the assembly form.
3) If rounding is performed, the INEX bit is set.
4) If one source is SNaN or QNaN, the result is NaN_out. If either source is SNaN,
the INVAL bit is set also.
5) If both sources are +infinity or −infinity, the result is NaN_out and the INVAL
bit is set.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-665
Submit Documentation Feedback
4.306 SUBDP
— www.ti.com
6) If one source is signed infinity and the other source is anything except NaN or
signed infinity of the same sign, the result is signed infinity and the INFO bit is
set.
7) If overflow occurs, the INEX and OVER bits are set and the results are set as
follows (LFPN is the largest floating-point number):
a. If underflow occurs, the INEX and UNDER bits are set and the results are set
as follows (SPFN is the smallest floating-point number):
b. If the sources are equal numbers of the same sign, the result is +0 unless the
rounding mode is −infinity, in which case the result is −0.
c. the sources are both 0 with opposite signs or both denormalized with
opposite signs, the sign of the result is the same as the sign of src1.
d. A signed denormalized source is treated as a signed 0 and the DENn bit is set.
If the other source is not NaN or signed infinity, the INEX bit is also set.
Execution if (cond)src1 - src2 → dst
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4 E5 E6 E7
Read src1_l, src1_h,
src2_l src2_h
Written dst_l dst_h
Unit in use .L or .S .L or .S
The low half of the result is written out one cycle earlier than the high half. If dst is used
as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP, MPYDP,
MPYSPDP, MPYSP2DP, or SUBDP instruction, the number of delay slots can be
reduced by one, because these instructions read the lower word of the DP source one
cycle before the upper word of the DP source.
Instruction Type ADDDP/SUBDP
Delay Slots 6
4-666 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.306 SUBDP
www.ti.com —
B1:B0 4021 3333h 3333 3333h B1:B0 4021 3333h 4021 3333h 8.6
A3:A2 C004 0000h 0000 0000h A3:A2 C004 0000h 0000 0000h -2.5
A5:A4 xxxx xxxxh xxxx xxxxh A5:A4 4026 3333h 3333 3333h 11.1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-667
Submit Documentation Feedback
4.307 SUBSP
— www.ti.com
4.307 SUBSP
Subtract Two Single-Precision Floating-Point Values
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
Note—
1) This instruction takes the rounding mode from and sets the warning bits in the
floating-point adder configuration register (FADCR), not the floating-point
auxiliary configuration register (FAUCR) as for other .S unit instructions.
2) The source specific warning bits set in FADCR are set according to the registers
sources in the actual machine instruction and not according to the order of the
sources in the assembly form.
3) If rounding is performed, the INEX bit is set.
4) If one source is SNaN or QNaN, the result is NaN_out. If either source is SNaN,
the INVAL bit is set also.
5) If both sources are +infinity or −infinity, the result is NaN_out and the INVAL
bit is set.
4-668 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.307 SUBSP
www.ti.com —
6) If one source is signed infinity and the other source is anything except NaN or
signed infinity of the same sign, the result is signed infinity and the INFO bit is
set.
7) If overflow occurs, the INEX and OVER bits are set and the results are set as
follows (LFPN is the largest floating-point number):
8) If underflow occurs, the INEX and UNDER bits are set and the results are set
as follows (SPFN is the smallest floating-point number):
9) If the sources are equal numbers of the same sign, the result is +0 unless the
rounding mode is −infinity, in which case the result is −0.
10)If the sources are both 0 with opposite signs or both denormalized with
opposite signs, the sign of the result is the same as the sign of src1.
11)A signed denormalized source is treated as a signed 0 and the DENn bit is set.
If the other source is not NaN or signed infinity, the INEX bit is also set.
Execution if (cond)src1 - src2 → dst
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .L or .S
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-669
Submit Documentation Feedback
4.307 SUBSP
— www.ti.com
4-670 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.308 SUBU
www.ti.com —
4.308 SUBU
Subtract Two Unsigned Integers Without Saturation
or
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-671
Submit Documentation Feedback
4.308 SUBU
— www.ti.com
4-672 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.309 SUB2
www.ti.com —
4.309 SUB2
Subtract Two 16-Bit Integers on Upper and Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 0 0 1 0 0 1 1 0 s p
5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17
creg z dst src2 src1
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 0 0 1 1 0 0 0 s p
5 1 1 1
Opcode .D unit
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 0 0 1 0 1 1 1 0 0 s p
5 1 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-673
Submit Documentation Feedback
4.309 SUB2
— www.ti.com
Description The upper and lower halves of src2 are subtracted from the upper and lower halves of
src1 and the result is placed in dst. Any borrow from the lower-half subtraction does
not affect the upper-half subtraction. Specifically, the upper-half of src2 is subtracted
from the upper-half of src1 and placed in the upper-half of dst. The lower-half of src2
is subtracted from the lower-half of src1 and placed in the lower-half of dst.
31 16 15 0
a_hi a_lo ←src1
- -
SUB2
= =
31 16 15 0
a_hi - b_hi a_lo - b_lo ←dst
Note—Unlike the SUB instruction, the argument ordering on the .D unit form
of .S2 is consistent with the argument ordering for the .L and .S unit forms.
Execution if (cond){
(lsb16(src1) - lsb16(src2)) → lsb16(dst);
(msb16(src1) - msb16(src2)) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S, .D
Delay Slots 0
Examples Example 1
4-674 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.309 SUB2
www.ti.com —
Example 2
Example 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-675
Submit Documentation Feedback
4.310 SUB4
— www.ti.com
4.310 SUB4
Subtract Without Saturation, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 1 1 0 0 1 1 0 1 1 0 s p
5 1 1 1
Description Performs 2s-complement subtraction between packed 8-bit quantities. The values in
src1 and src2 are treated as packed 8-bit data and the results are written into dst in a
packed 8-bit format.
For each pair of 8-bit values in src1 and src2, the difference between the 8-bit value from
src1 and the 8-bit value from src2 is calculated to produce an 8-bit result. No saturation
is performed. The result is placed in the corresponding position in dst:
• The difference between src1 byte0 and src2 byte0 is placed in byte0 of dst.
• The difference between src1 byte1 and src2 byte1 is placed in byte1 of dst.
• The difference between src1 byte2 and src2 byte2 is placed in byte2 of dst.
• The difference between src1 byte3 and src2 byte3 is placed in byte3 of dst.
4-676 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.310 SUB4
www.ti.com —
31 24 23 16 15 8 7 0
a_3 a_2 a_1 a_0 ←src1
- - - -
SUB4
= = = =
31 24 23 16 15 8 7 0
a_3 - b_3 a_2 - b_2 a_1 - b_1 a_0 - b_0 ←dst
Execution if (cond){
(byte0(src1) - byte0(src2)) → byte0(dst);
(byte1(src1) - byte1(src2)) → byte1(dst);
(byte2(src1) - byte2(src2)) → byte2(dst);
(byte3(src1) - byte3(src2)) → byte3(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-677
Submit Documentation Feedback
4.311 SWAP2
— www.ti.com
4.311 SWAP2
Swap Bytes in Upper and Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18 17
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 0 1 1 1 1 0 s p
5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17
creg z dst src2 src1
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 0 0 0 0 1 0 0 0 s p
5 1 1 1
Description The SWAP2 pseudo-operation takes the lower halfword from src2 and places it in the
upper halfword of dst, while the upper halfword from src2 is placed in the lower
halfword of dst.
31 16 15 0
b_hi b_lo ←src2
SWAP2
4-678 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.311 SWAP2
www.ti.com —
31 16 15 0
b_lo b_hi ←dst
The SWAP2 instruction can be used in conjunction with the SWAP4 instruction (see
SWAP4) to change the byte ordering (and therefore, the endianess) of 32-bit data.
Execution if (cond){
msb16(src2) → lsb16(dst);
lsb16(src2) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-679
Submit Documentation Feedback
4.312 SWAP4
— www.ti.com
4.312 SWAP4
Swap Byte Pairs in Upper and Lower Register Halves
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 1 1 0 1 0 1 1 0 s p
1 1 1
Description Exchanges pairs of bytes within each halfword of src2, placing the result in dst. The
values in src2 are treated as unsigned, packed 8-bit values.
Specifically the upper byte in the upper halfword is placed in the lower byte in the upper
halfword, while the lower byte of the upper halfword is placed in the upper byte of the
upper halfword. Also the upper byte in the lower halfword is placed in the lower byte
of the lower halfword, while the lower byte in the lower halfword is placed in the upper
byte of the lower halfword.
31 24 23 16 15 8 7 0
ub_3 ub_2 ub_1 ub_0 ←src2
SWAP4
31 24 23 16 15 8 7 0
ub_2 ub_3 ub_0 ub_1 ←dst
4-680 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.312 SWAP4
www.ti.com —
By itself, this instruction changes the ordering of bytes within halfwords. This
effectively changes the endianess of 16-bit data packed in 32-bit words. The endianess
of full 32-bit quantities can be changed by using the SWAP4 instruction in conjunction
with the SWAP2 instruction (see SWAP2).
Execution if (cond){
ubyte0(src2) → ubyte1(dst);
ubyte1(src2) → ubyte0(dst);
ubyte2(src2) → ubyte3(dst);
ubyte3(src2) → ubyte2(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-681
Submit Documentation Feedback
4.313 SWE
— www.ti.com
4.313 SWE
Software Exception
Syntax SWE
unit = none
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description Causes an internal exception to be taken. It can be used as a mechanism for User mode
programs to request Supervisor mode services. Execution of the SWE instruction
results in an exception being recognized in the E1 pipeline phase containing the SWE
instruction. The SXF bit in EFR is set to 1. The HWE bit in NTSR is cleared to 0. If
exceptions have been globally enabled, this causes an exception to be recognized before
execution of the next execute packet. The address of that next execute packet is placed
in NRP.
Delay Slots 0
4-682 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.314 SWENR
www.ti.com —
4.314 SWENR
Software Exception—No Return
Syntax SWENR
unit = none
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description Causes an internal exception to be taken. It is intended for use in systems supporting a
secure operating mode. It can be used as a mechanism for User mode programs to
request Supervisor mode services. It differs from the SWE instruction in four ways:
1. TSR is not copied into NTSR.
2. No return address is placed in NRP (it remains unmodified).
3. The IB bit in TSR is set to 1. This will be observable only in the case where
another exception is recognized simultaneously.
4. A branch to REP (restricted entry point register) is forced in the context switch
rather than the ISTP-based exception (NMI) vector.
This instruction executes unconditionally.
Delay Slots 0
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-683
Submit Documentation Feedback
4.315 UNPKBU4
— www.ti.com
4.315 UNPKBU4
Unpack All Unsigned Packed 8-bit to Unsigned Packed 16-bit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
3 5 5 5
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 opfield x 0 0 1 1 0 1 0 1 1 0 s p
3 5 5 5
Description The UNPKBU4 instruction unpacks the unsigned bytes of xop into the half-words of
dwdst.
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
4-684 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.315 UNPKBU4
www.ti.com —
Instruction Type Single cycle
Delay Slots 0
Example A3 == 0xaabb778d
UNPKBU4 .S A3:A2A1
A2 == 0x00aa00bb
A1 == 0x0077008d
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-685
Submit Documentation Feedback
4.316 UNPKH2
— www.ti.com
4.316 UNPKH2
Unpack High Signed Packed 16-bit to Packed 32-bit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
3 5 5 5
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 opfield x 0 0 1 1 0 1 0 1 1 0 s p
3 5 5 5
Description The UNPKH2 instruction extracts the 2 signed 16-bit integers in xop and expands each
to a 32-bit value.
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
Execution if(cond) {
slsb16(src1) -> dst_e
smsb16(src1) -> dst_o
}
else nop
4-686 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.316 UNPKH2
www.ti.com —
Instruction Type Single cycle
Delay Slots 0
Example A3 == 0x82c47688
UNPKH2 .S A3,A1:A0
A1 == 0xFFFF82c4
A0 == 0x00007688
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-687
Submit Documentation Feedback
4.317 UNPKHU2
— www.ti.com
4.317 UNPKHU2
Unpack High Unsigned Packed 16-bit to Packed 32-bit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
3 5 5 5
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 opfield x 0 0 1 1 0 1 0 1 1 0 s p
3 5 5 5
Description The UNPKHU2 instruction extracts the 2 unsigned 16-bit integers in xop and expands
each to a 32-bit value.
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
Execution if(cond) {
ulsb16(src1) -> lsb16(dst_e)
0 -> msb16(dst_e)
umsb16(src1) -> lsb16(dst_o)
0 -> msb16(dst_o)
}
else nop
4-688 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.317 UNPKHU2
www.ti.com —
Delay Slots 0
Functional Unit Latency 1
Example A3 == 0x82c47688
UNPKHU2 .S A3,A1:A0
A1 == 0x000082c4
A0 == 0x00007688
A3 == 0xffffffff
UNPKHU2 .S A3,A1:A0
A1 == 0x0000ffff
A0 == 0x0000ffff
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-689
Submit Documentation Feedback
4.318 UNPKHU4
— www.ti.com
4.318 UNPKHU4
Unpack 16 MSB Into Two Lower 8-Bit Halfwords of Upper and Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 1 x 0 0 1 1 0 1 0 1 1 0 s p
1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 16
creg z dst src2 0 0
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 1 x 1 1 1 1 0 0 1 0 0 0 s p
1 1 1
Description Moves the two most-significant bytes of src2 into the two low bytes of the two
halfwords of dst.
Specifically the upper byte in the upper halfword is placed in the lower byte in the upper
halfword, while the lower byte of the upper halfword is placed in the lower byte of the
lower halfword. The src2 bytes are zero-extended when unpacked, filling the two high
bytes of the two halfwords of dst with zeros.
4-690 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.318 UNPKHU4
www.ti.com —
31 24 23 16 15 8 7 0
ub_3 ub_2 ub_1 ub_0 ←src2
UNPKHU4
31 24 23 16 15 8 7 0
00000000 ub_3 00000000 ub_2 ←dst
Execution if (cond){
ubyte3(src2) → ubyte2(dst);
0 → ubyte3(dst);
ubyte2(src2) → ubyte0(dst);
0 → ubyte1(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
A1 9E 52 6E 30h A1 9E 52 6E 30h
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-691
Submit Documentation Feedback
4.318 UNPKHU4
— www.ti.com
4-692 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.319 UNPKLU4
www.ti.com —
4.319 UNPKLU4
Unpack 16 LSB Into Two Lower 8-Bit Halfwords of Upper and Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 0 x 0 0 1 1 0 1 0 1 1 0 s p
1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 16
creg z dst src2 0 0
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 0 x 1 1 1 1 0 0 1 0 0 0 s p
1 1 1
Description Moves the two least-significant bytes of src2 into the two low bytes of the two halfwords
of dst.
Specifically, the upper byte in the lower halfword is placed in the lower byte in the
upper halfword, while the lower byte of the lower halfword is kept in the lower byte of
the lower halfword. The src2 bytes are zero-extended when unpacked, filling the two
high bytes of the two halfwords of dst with zeros.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-693
Submit Documentation Feedback
4.319 UNPKLU4
— www.ti.com
31 24 23 16 15 8 7 0
ub_3 ub_2 ub_1 ub_0 ←src2
UNPKLU4
31 24 23 16 15 8 7 0
00000000 ub_1 00000000 ub_0 ←dst
Execution if (cond){
ubyte0(src2) → ubyte0(dst);
0 → ubyte1(dst);
ubyte1(src2) → ubyte2(dst);
0 → ubyte3(dst);
}
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
A1 9E 52 6E 30h A1 9E 52 6E 30h
Example 2
4-694 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.319 UNPKLU4
www.ti.com —
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-695
Submit Documentation Feedback
4.320 XOR
— www.ti.com
4.320 XOR
Bitwise Exclusive OR
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
3 5 5 5 7
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x opcode 1 1 0 s p
5 5 5 7
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
Opcode Opcode for .L Unit, 1/2 src - same as L2S but fixed hdr, bit 23-msb of opcode
31 30 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
5 5 5 7
4-696 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.320 XOR
www.ti.com —
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
3 5 5 5 6
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
5 5 5 6
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
3 5 5 5 6
31 30 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x opcode 1 0 0 0 s p
5 5 5 6
31 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
3 5 5 5 4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-697
Submit Documentation Feedback
4.320 XOR
— www.ti.com
Description The XOR instruction performs the bit-wise XOR between the two source registers and
stores the result in the destination register. Note that one can use the constant form of
XOR to produce the one's compliment of a number. The XOR instruction may also be
used to clear a register.
Delay Slots 0
See Also
Example A0 == 0x05af0137
XOR .L -1,A0,A15 ; Negate A1 (1's compliment)
A15 == 0xfa50fec8
A0 == 0xbe10fa31
A1 == 0x00ff00ff
XOR .L A0,A1,A15
A15 == 0xbeefface
A1 == 0x05af0137
A0 == 0x05af0137
XOR .L -1,A1:A0,A15:A14 ; Negate A1 (1's compliment)
A15 == 0xfa50fec8
A14 == 0xfa50fec8
A1 == 0xbe10fa31
A0 == 0xbe10fa31
A3 == 0x00ff00ff
A2 == 0x00ff00ff
XOR .L A1:A0,A3:A2,A15:A14
A15 == 0xbeefface
A14 == 0xbeefface
4-698 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.321 XORMPY
www.ti.com —
4.321 XORMPY
Galois Field Multiply With Zero Polynomial
Opcode
31 30 29 28 27 23 22 18 17
5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 1 1 0 1 1 1 1 0 0 s p
5 1 1 1
Description Performs a Galois field multiply, where src1 is 32 bits and src2 is limited to 9 bits. This
multiply connects all levels of the gmpy4 together and only extends out by 8 bits. The
XORMPY instruction is identical to a GMPY instruction executed with a zero-value
polynomial.
uint pp;
uint mask, tpp;
uint I;
pp = 0;
mask = 0x00000100; // multiply by computing
// partial products.
for ( I=0; i<8; I++ ){
if ( src2 & mask ) pp ^= src1;
mask >>= 1;
pp <<= 1;
}
if ( src2 & 0x1 ) pp ^= src1;
return (pp) ; // leave it asserted left.
}
Execution GMPY_poly = 0
(lsb9(src2) gmpy uint(src1)) → uint(dst)
Delay Slots 3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-699
Submit Documentation Feedback
4.321 XORMPY
— www.ti.com
A0 12345678h A2 1E654210h
A1 00000126h
4-700 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.322 XPND2
www.ti.com —
4.322 XPND2
Expand Bits to Packed 16-Bit Masks
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 1 x 0 0 0 0 1 1 1 1 0 0 s p
1 1 1
Description Reads the two least-significant bits of src2 and expands them into two halfword masks
written to dst. Bit 1 of src2 is replicated and placed in the upper halfword of dst. Bit 0 of
src2 is replicated and placed in the lower halfword of dst. Bits 2 through 31 of src2 are
ignored.
31 24 23 16 15 8 7 0
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXX10 ←src2
XPND2
31 24 23 16 15 8 7 0
11111111 11111111 00000000 00000000 ←dst
The XPND2 instruction is useful, when combined with the output of the CMPGT2 or
CMPEQ2 instruction, for generating a mask that corresponds to the individual
halfword positions that were compared. That mask may then be used with ANDN,
AND, or OR instructions to perform other operations like compositing. This is an
example:
CMPGT2.S1A3, A4, A5 ; Compare two registers, both upper
; and lower halves.
AND .D1 A2, A7, A8 ; Apply the mask to a value to create result.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-701
Submit Documentation Feedback
4.322 XPND2
— www.ti.com
Because the XPND2 instruction only examines the two least-significant bits of src2, it
is possible to store a large bit mask in a single 32-bit word and expand it using multiple
SHR and XPND2 instruction pairs. This can be useful for expanding a packed
1-bit-per-pixel bitmap into full 16-bit pixels in imaging applications.
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
Example 2
4-702 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.323 XPND4
www.ti.com —
4.323 XPND4
Expand Bits to Packed 8-Bit Masks
Opcode
31 29 28 27 23 22 18 17 16
3 1 5 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 0 0 1 1 1 1 0 0 s p
1 1 1
Description Reads the four least-significant bits of src2 and expands them into four-byte masks
written to dst. Bit 0 of src2 is replicated and placed in the least-significant byte of dst.
Bit 1 of src2 is replicated and placed in second least-significant byte of dst. Bit 2 of src2
is replicated and placed in second most-significant byte of dst. Bit 3 of src2 is replicated
and placed in most-significant byte of dst. Bits 4 through 31 of src2 are ignored.
31 24 23 16 15 8 7 0
XXXXXXXX XXXXXXXX XXXXXXXX XXXX1001 ←src2
XPND4
31 24 23 16 15 8 7 0
11111111 00000000 00000000 11111111 ←dst
The XPND4 instruction is useful, when combined with the output of the CMPGT4 or
CMPEQ4 instruction, for generating a mask that corresponds to the individual byte
positions that were compared. That mask may then be used with ANDN, AND, or OR
instructions to perform other operations like compositing.
This is an example:
CMPEQ4.S1A3, A4, A5 ; Compare two 32-bit registers all four bytes.
NOP
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-703
Submit Documentation Feedback
4.323 XPND4
— www.ti.com
Because the XPND4 instruction only examines the four least-significant bits of src2, it
is possible to store a large bit mask in a single 32-bit word and expand it using multiple
SHR and XPND4 instruction pairs. This can be useful for expanding a packed,
1-bit-per-pixel bitmap into full 8-bit pixels in imaging applications.
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
Example 2
4-704 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
4.324 ZERO
www.ti.com —
4.324 ZERO
Zero a Register
or
Opcode
Opcode map field used... For operand type... Unit Opfield
dst sint .L1, .L2 0010111
dst slong .L1, .L2 0110111
dst sint .D1, .D2 010001
dst sint .S1, .S2 010111
Description This is a pseudo-operation used to fill the destination register or register pair with 0s.
When the destination is a single register, the assembler uses the MVK instruction to
load it with zeros: MVK (.unit) 0, dst
When the destination is a register pair, the assembler uses the SUB instruction to
subtract a value from itself and store the result in the destination pair.
Execution if (cond)0 → dstelse nop
or
Delay Slots 0
Examples Example 1
ZERO .D1 A1
Example 2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 4-705
Submit Documentation Feedback
4.324 ZERO
— www.ti.com
4-706 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Chapter 5
Pipeline
This chapter starts with a description of the pipeline flow. Highlights are:
• The pipeline can dispatch eight parallel instructions every cycle.
• Parallel instructions proceed simultaneously through each pipeline phase.
• Serial instructions proceed through the pipeline with a fixed relative phase
difference between instructions.
• Load and store addresses appear on the CPU boundary during the same pipeline
phase, eliminating read-after-write memory conflicts.
All instructions require the same number of pipeline phases for fetch and decode, but
require a varying number of execute phases. This chapter contains a description of the
number of execution phases for each type of instruction.
Finally, this chapter contains performance considerations for the pipeline. These
considerations include the occurrence of fetch packets that contain multiple execute
packets, execute packets that contain multicycle NOPs, and memory considerations for
the pipeline. For more information about fully optimizing a program and taking full
advantage of the pipeline, see the TMS320C6000 Programmer's Guide (SPRU198).
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-1
Submit Documentation Feedback
5.1 Pipeline Operation Overview
Chapter 5—Pipeline www.ti.com
5.1.1 Fetch
The fetch phases of the pipeline are:
• PG: Program address generate
• PS: Program address send
• PW: Program access ready wait
• PR: Program fetch packet receive
The DSP uses a fetch packet (FP) of eight words. All eight of the words proceed through
fetch processing together, through the PG, PS, PW, and PR phases. Figure 5-2(a) shows
the fetch phases in sequential order from left to right. Figure 5-2(b) is a functional
diagram of the flow of instructions through the fetch phases. During the PG phase, the
program address is generated in the CPU. In the PS phase, the program address is sent
to memory. In the PW phase, a memory read occurs. Finally, in the PR phase, the fetch
packet is received at the CPU. Figure 5-2(c) shows fetch packets flowing through the
phases of the fetch stage of the pipeline. In Figure 5-2(c), the first fetch packet (in PR)
is made up of four execute packets, and the second and third fetch packets (in PW and
PS) contain two execute packets each. The last fetch packet (in PG) contains a single
execute packet of eight instructions.
5-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.1 Pipeline Operation Overview
www.ti.com Chapter 5—Pipeline
PR Memory
PS
PG
(c)
Fetch 256
Decode
5.1.2 Decode
The decode phases of the pipeline are:
• DP: Instruction dispatch
• DC: Instruction decode
In the DP phase of the pipeline, the fetch packets are split into execute packets. Execute
packets consist of one instruction or from two to eight parallel instructions. During the
DP phase, the instructions in an execute packet are assigned to the appropriate
functional units. In the DC phase, the source registers, destination registers, and
associated paths are decoded for the execution of the instructions in the functional
units.
Figure 5-3(a) shows the decode phases in sequential order from left to right.
Figure 5-3(b) shows a fetch packet that contains two execute packets as they are
processed through the decode stage of the pipeline. The last six instructions of the fetch
packet (FP) are parallel and form an execute packet (EP). This EP is in the dispatch
phase (DP) of the decode stage. The arrows indicate each instruction's assigned
functional unit for execution during the same cycle. The NOP instruction in the eighth
slot of the FP is not dispatched to a functional unit because there is no execution
associated with it.
The first two slots of the fetch packet (shaded below) represent an execute packet of two
parallel instructions that were dispatched on the previous cycle. This execute packet
contains two MPY instructions that are now in decode (DC) one cycle before
execution. There are no instructions decoded for the .L, .S, and .D functional units for
the situation illustrated.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-3
Submit Documentation Feedback
5.1 Pipeline Operation Overview
Chapter 5—Pipeline www.ti.com
(b)
Decode 32 32 32 32 32 32 32 32
ADD ADD STW STW ADDK NOP(A) DP
MPYH MPYH DC
Functional
.L1 .S1 .M1 .D1 units .D2 .M2 .S2 .L2
5.1.3 Execute
The execute portion of the pipeline is subdivided into five phases (E1-E5). Different
types of instructions require different numbers of these phases to complete their
execution. These phases of the pipeline play an important role in your understanding
the device state at CPU cycle boundaries. The execution of different types of
instructions in the pipeline is described in ‘‘Pipeline Execution of Instruction Types’’
on page 5-9. Figure 5-4(a) shows the execute phases of the pipeline in sequential order
from left to right. Figure 5-3(b) shows the portion of the functional block diagram in
which execution occurs.
Figure 5-4 Execute Phases of the Pipeline
(a) E1 E2 E3 E4 E5
(b)
Execute
E1
SADD B SMPY SMPY STH SMPYH SUB SADD
.L1 .S1 .M1 .M1 .D2 .M2 .S2 .L2
... 32
...
31 30 29 28 10 9 8 7 6 5 4 3 2 1 0 31 30 29 28 10 9 8 7 6 5 4 3 2 1 0
Register file A 64 64 64 64 Register file B
ST1 LD1 LD2 ST2
32 DA1 DA1 32
Data address 1 Data address 2
L1 Data cache control
5-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.1 Pipeline Operation Overview
www.ti.com Chapter 5—Pipeline
Decode
Fetch Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Figure 5-6 shows an example of the pipeline flow of consecutive fetch packets that
contain eight parallel instructions. In this case, where the pipeline is full, all instructions
in a fetch packet are in parallel and split into one execute packet per fetch packet. The
fetch packets flow in lockstep fashion through each phase of the pipeline.
For example, examine cycle 7 in Figure 5-6. When the instructions from FPn reach E1,
the instructions in the execute packet from FP n +1 are being decoded. FP n + 2 is in
dispatch while FPs n + 3, n + 4, n + 5, and n + 6 are each in one of four phases of
program fetch. See ‘‘Performance Considerations’’ on page 5-43 for additional detail
on code flowing through the pipeline. Table 5-1 on page 5-5 summarizes the pipeline
phases and what happens in each phase.
Figure 5-6 Pipeline Operation: One Execute Packet per Fetch Packet
Fetch
Packet 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
n PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9
n+3 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8
n+4 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7
n+5 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6
n+6 PG PS PW PR DP DC E1 E2 E3 E4 E5
n+7 PG PS PW PR DP DC E1 E2 E3 E4
n+8 PG PS PW PR DP DC E1 E2 E3
n+9 PG PS PW PR DP DC E1 E2
n+10 PG PS PW PR DP DC E1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-5
Submit Documentation Feedback
5.1 Pipeline Operation Overview
Chapter 5—Pipeline www.ti.com
5-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.1 Pipeline Operation Overview
www.ti.com Chapter 5—Pipeline
Figure 5-7 shows a functional block diagram of the pipeline stages. The pipeline
operation is based on CPU cycles. A CPU cycle is the period during which a particular
execute packet is in a particular pipeline phase. CPU cycle boundaries always occur at
clock cycle boundaries.
Figure 5-7 Pipeline Phases Block Diagram
Fetch 256
Decode 32 32 32 32 32 32 32 32
STH STH SADD SADD SMPYH SMPY SUB B DP
Execute
.. ..
32
31 30 29 28 10 9 8 7 6 5 4 3 2 1 0 31 30 29 28 10 9 8 7 6 5 4 3 2 1 0
Register file A ST 1 Data 1 LD 1 LD 2 Data 2 ST 2 Register file B
64 64 64 64
32 DA 1 32
DA 2
Data cache control
As code flows through the pipeline phases, it is processed by different parts of the DSP.
Figure 5-7 shows a full pipeline with a fetch packet in every phase of fetch. One execute
packet of eight instructions is being dispatched at the same time that a 7-instruction
execute packet is in decode. The arrows between DP and DC correspond to the
functional units identified in the code in Example 5-1.
In the DC phase portion of Figure 5-7, one box is empty because a NOP was the eighth
instruction in the fetch packet in DC and no functional unit is needed for a NOP.
Finally, Figure 5-7 shows six functional units processing code during the same cycle of
the pipeline.
Registers used by the instructions in E1 are shaded in Figure 5-7. The multiplexers used
for the input operands to the functional units are also shaded in the figure. The bold
crosspaths are used by the MPY instructions.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-7
Submit Documentation Feedback
5.1 Pipeline Operation Overview
Chapter 5—Pipeline www.ti.com
Most DSP instructions are single-cycle instructions, which means they have only one
execution phase (E1). A small number of instructions require more than one execute
phase. The types of instructions, each of which require different numbers of
execute phases, are described in 5.2 ‘‘Pipeline Execution of Instruction Types’’.
5-8 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
www.ti.com Chapter 5—Pipeline
The execution of instructions is defined in terms of delay slots. A delay slot is a CPU
cycle that occurs after the first execution phase (E1) of an instruction. Results from
instructions with delay slots are not available until the end of the last delay slot. For
example, a multiply instruction has one delay slot, which means that one CPU cycle
elapses before the results of the multiply are available for use by a subsequent
instruction. However, results are available from other instructions finishing execution
during the same CPU cycle in which the multiply is in a delay slot.
Table 5-2 Execution Stage Length Description for Each Instruction Type - Part A
Instruction Type
Execution 16 × 16 Single Multiply/.M Multiply
Phase1,2 Single Cycle Unit Nonmultiply Store Extensions Load Branch
E1 Compute result and write Read operands and start Compute address Reads operands and Compute Target code
to register computations start computations address in PG3
E2 Compute result and write Send address and Send address to
to register data to memory memory
E3 Access memory Access memory
E4 Write results to Send data back
register to CPU
E5 Write data into
register
Delay slots 0 1 04 3 44 53
Functional 1 1 1 1 1 1
unit latency
1. This table assumes that the condition for each instruction is evaluated as true. If the condition is evaluated as false, the instruction does not write any results or have any
pipeline operation after E1.
2. NOP is not shown and has no operation in any of the execution phases.
3. See section 5.2.6 for more information on branches.
4. See section 5.2.3 and section 5.2.5 for more information on execution and delay slots for stores and loads.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-9
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
Chapter 5—Pipeline www.ti.com
Table 5-3 Execution Stage Length Description for Each Instruction Type - Part B
Instruction Type
Execution Phase1,2 1-Cycle DP 3-Cycle 4-Cycle INSTDP DP Compare
E1 Compute the lower Read sources and start Read sources and start Read sources and start Read lower sources and
results and write to computation computation computation start computation
register
E2 Compute the upper Continue computation Continue computation Continue computation Read upper sources,
results and write to finish computation, and
register write results to register
E3 Complete computation Continue computation Continue computation
and write results to
register
E4 Complete computation Continue computation
and write results to and write lower results
register to register
E5 Complete computation
and write upper results
to register
Delay slots 1 2 3 4 1
Functional unit 1 1 1 1 1
latency
1. This table assumes that the condition for each instruction is evaluated as true. If the condition is evaluated as false, the instruction does not write any results or have any
pipeline operation after E1.
2. NOP is not shown and has no operation in any of the execution phases.
Table 5-4 Execution Stage Length Description for Each Instruction Type - Part C
Instruction Type
Execution Phase1,2 ADDDP/SUBDP MPYI MPYID MPYDP
E1 Read lower sources and start Read sources and start Read sources and start Read lower sources and start
computation computation computation computation
E2 Read upper sources and Read sources and continue Read sources and continue Read lower src1 and upper
continue computation computation computation src2 and continue
computation
E3 Continue computation Read sources and continue Read sources and continue Read lower src2 and upper
computation computation src1 and continue
computation
E4 Continue computation Read sources and continue Read sources and continue Read upper sources and
computation computation continue computation
E5 Continue computation Continue computation Continue computation Continue computation
E6 Compute the lower results Continue computation Continue computation Continue computation
and write to register
E7 Compute the upper results Continue computation Continue computation Continue computation
and write to register
E8 Continue computation Continue computation Continue computation
E9 Complete computation and Continue computation and Continue computation and
write results to register write lower results to register write lower results to register
E10 Complete computation and Complete computation and
write upper results to register write upper results to register
Delay slots 6 8 9 9
Functional unit 2 4 4 4
latency
1. This table assumes that the condition for each instruction is evaluated as true. If the condition is evaluated as false, the instruction does not write any results or have any
pipeline operation after E1.
2. NOP is not shown and has no operation in any of the execution phases.
5-10 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
www.ti.com Chapter 5—Pipeline
Table 5-5 Execution Stage Length Description for Each Instruction Type - Part D
Instruction Type
Execution Phase12 MPYSPDP MPYSP2DP
E1 Read src1 and lower src2 and start computation Read sources and start computation
E2 Read src1 and upper src2 and continue computation Continue computation
E3 Continue computation Continue computation
E4 Continue computation Continue computation and write lower results to register
E5 Continue computation Complete computation and write upper results to register
E6 Continue computation and write lower results to register
E7 Complete computation and write upper results to register
Delay slots 6 4
Functional unit 3 2
latency
1. This table assumes that the condition for each instruction is evaluated as true. If the condition is evaluated as false, the instruction does not write any results or have any
pipeline operation after E1.
2. NOP is not shown and has no operation in any of the execution phases.
Figure 5-9 shows the single-cycle execution diagram. The operands are read, the
operation is performed, and the results are written to a register, all during E1.
Single-cycle instructions have no delay slots.
Table 5-6 Single-Cycle Instruction Execution
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S, .M, or .D
PG PS PW PR DP DC E1
Functional
unit
.L, .S, .M,
or .D
Operands
(data)
Write results
Register file E1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-11
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
Chapter 5—Pipeline www.ti.com
Figure 5-11 shows the operations occurring in the pipeline for a multiply instruction.
In the E1 phase, the operands are read and the multiply begins. In the E2 phase, the
multiply finishes, and the result is written to the destination register. Multiply
instructions have one delay slot. Figure 5-11 also applies to the other .M unit
nonmultiply operations.
Table 5-7 Multiply Instruction Execution
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
PG PS PW PR DP DC E1 E2 1 delay slot
Functional
unit
.M
Operands
(data)
Write results
E1
Register file E2
5-12 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
www.ti.com Chapter 5—Pipeline
Figure 5-13 shows the operations occurring in the pipeline phases for a store
instruction. In the E1 phase, the address of the data to be stored is computed. In the E2
phase, the data and destination addresses are sent to data memory. In the E3 phase, a
memory write is performed. The address modification is performed in the E1 stage of
the pipeline. Even though stores finish their execution in the E3 phase of the pipeline,
they have no delay slots. There is additional explanation of why stores have zero delay
slots in Section 5.2.5 .
PG PS PW PR DP DC E1 E2 E3
Address
modification
Figure 5-13 Store Instruction Execution Block Diagram
Functional
unit
.D
E2
E1
Register file
Data
E2 Memory controller
Address
E3
Memory
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-13
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
Chapter 5—Pipeline www.ti.com
When you perform a load and a store to the same memory location, these rules apply
(i = cycle):
• When a load is executed before a store, the old value is loaded and the new value
is stored.
i LDW
i+ 1 STW
• When a store is executed before a load, the new value is stored and the new value
is loaded.
i STW
i + 1 LDW
• When the instructions are executed in parallel, the old value is loaded first and
then the new value is stored, but both occur in the same phase.
i STW
i || LDW
Figure 5-15 shows the operations occurring in the pipeline for the multiply extensions.
In the E1 phase, the operands are read and the multiplies begin. In the E4 phase,
the multiplies finish, and the results are written to the destination register. Extended
multiply instructions have three delay slots.
PG PS PW PR DP DC E1 E2 E3 E4
3 delay slots
Functional
unit
.M
Operands
(data)
Write results
Register file E4
5-14 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
www.ti.com Chapter 5—Pipeline
Figure 5-17 shows the operations occurring in the pipeline phases for a load. In the E1
phase, the data address pointer is modified in its register. In the E2 phase, the data
address is sent to data memory. In the E3 phase, a memory read at that address is
performed.
Table 5-10 Load Instruction Execution
Pipeline Stage E1 E2 E3 E4 E5
Read baseR, offsetR, src
Written baseR dst
Unit in use .D
PG PS PW PR DP DC E1 E2 E3 E4 E5
4 delay slots
modification
Address
Functional
unit
.D
E2
E1
E5
Register file
Data
E4
Memory controller
Address
E3
Memory
In the E4 stage of a load, the data is received at the CPU core boundary. Finally, in the
E5 phase, the data is loaded into a register. Because data is not written to the register
until E5, load instructions have four delay slots. Because pointer results are written to
the register in E1, there are no delay slots associated with the address modification.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-15
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
Chapter 5—Pipeline www.ti.com
In the following code, pointer results are written to the A4 register in the first execute
phase of the pipeline and data is written to the A3 register in the fifth execute phase.
Because a store takes three execute phases to write a value to memory and a load takes
three execute phases to read from memory, a load following a store accesses the value
placed in memory by that store in the cycle after the store is completed. This is why the
store is considered to have zero delay slots.
Figure 5-19 shows a branch instruction execution block diagram. If a branch is in the
E1 phase of the pipeline (in the .S2 unit in Figure 5-19), its branch target is in the fetch
packet that is in PG during that same cycle (shaded in the figure). Because the branch
target has to wait until it reaches the E1 phase to begin execution, the branch takes five
delay slots before the branch target code executes.
On the DSP, a stall is inserted if a branch is taken to an execute packet that spans fetch
packets to give time to fetch the second packet. Normally the assembler compensates
for this by preventing branch targets from spanning fetch packets. The one case in
which this cannot be done is in the case that an interrupt or exception occurred and the
return target is a fetch packet spanning execute packet.
Table 5-11 Branch Instruction Execution
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read src2
Written
Branch taken ✓
Unit in use .S2
PG PS PW PR DP DC E1
Branch
PG PS PW PR DP DC E1
target
5 delay slots
5-16 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
www.ti.com Chapter 5—Pipeline
Fetch 256
Decode 32 32 32 32 32 32 32 32
SMPYH SMPY SADD SADD B MVK DP
LDW LDW DC
Execute E1
MVK SMPY SMPYH B
.L1 .S1 .M1 .D1 .D2 .M2 .S2 .L2
The lower and upper 32 bits of the DP source are read on E1 using the src1 and src2
ports, respectively. The lower 32 bits of the DP source are written on E1 and the upper
32 bits of the DP source are written on E2. The two-cycle DP instructions are executed
on the .S units. The status is written to the FAUCR on E1. Figure 5-20 shows the fetch,
decode, and execute phases of the pipeline that the two-cycle DP instructions use.
Table 5-12 Two-Cycle DP Instruction Execution
Pipeline Stage E1 E2
Read src2_l,
src2_h
Written dst_l dst_h
Unit in use .S
PG PS PW PR DP DC E1 E2 1 delay slot
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-17
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
Chapter 5—Pipeline www.ti.com
PG PS PW PR DP DC E1 E2 E3 2 delay slots
5-18 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
www.ti.com Chapter 5—Pipeline
The sources are read on E1 and the results are written on E4. The four-cycle
instructions are executed on the .L or .M units. The status is written to the
floating-point multiplier configuration register (FMCR) or the floating-point adder
configuration register (FADCR) on E4. Figure 5-22 shows the fetch, decode, and
execute phases of the pipeline that the four-cycle instructions use.
Table 5-14 Four-Cycle Instruction Execution
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .L or .M
PG PS PW PR DP DC E1 E2 E3 E4
3 delay slots
PG PS PW PR DP DC E1 E2 E3 E4 E5
4 delay slots
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-19
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
Chapter 5—Pipeline www.ti.com
The DP compare instructions are executed on the .S unit. The functional unit latency
for DP compare instructions is 2. The status is written to the floating-point auxiliary
register (FAUCR) on E2. Figure 5-24 shows the fetch, decode, and execute phases of the
pipeline that the DP compare instruction uses.
Table 5-16 DP Compare Instruction Execution
Pipeline Stage E1 E2
Read src1_l, src2_l src1_h, src2_h
Written dst
Unit in use .S .S
PG PS PW PR DP DC E1 E2 1 delay slot
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7
6 delay slots
5-20 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
www.ti.com Chapter 5—Pipeline
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9
8 delay slots
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
9 delay slots
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-21
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
Chapter 5—Pipeline www.ti.com
instruction is executed on the .M unit. The functional unit latency for the MPYDP
instruction is 4. The status is written to the floating-point multiplier configuration
register (FMCR) on E9. Figure 5-28 shows the fetch, decode, and execute phases of the
pipeline that the MPYDP instruction uses.
Table 5-20 MPYDP Instruction Execution
Pipeline Stage E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Read src1_l, src1_l, src1_h, src1_h,
src2_l src2_h src2_l src2_h
Written dst_l dst_h
Unit in use .M .M .M .M
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
9 delay slots
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7
6 delay slots
5-22 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.2 Pipeline Execution of Instruction Types
www.ti.com Chapter 5—Pipeline
PG PS PW PR DP DC E1 E2 E3 E4 E5
4 delay slots
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-23
Submit Documentation Feedback
5.3 Functional Unit Constraints
Chapter 5—Pipeline www.ti.com
The following sections provide information about what happens during each execute
phase of the instructions within a category for each of the functional units.
5-24 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.3 Functional Unit Constraints
www.ti.com Chapter 5—Pipeline
Table 5-24 shows the instruction constraints for DP compare instructions executing on
the .S unit.
Table 5-24 DP Compare .S-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3
DP compare R RW
Instruction Type Subsequent Same-Unit Instruction Executable
Single-cycle Xrw ✓
DP compare Xr ✓
2-cycle DP Xrw ✓
ADDDP/SUBDP Xr ✓
ADDSP/SUBSP Xr ✓
Branch1 Xr ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single-cycle Xr ✓
Load Xr ✓
Store Xr ✓
INTDP Xr ✓
ADDDP/SUBDP Xr ✓
16 × 16 multiply Xr ✓
4-cycle Xr ✓
MPYI Xr ✓
MPYID Xr ✓
MPYDP Xr ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode
constraint; Xrw = Next instruction cannot enter E1 during cycle-read/decode/write constraint
1. The branch on register instruction is the only branch instruction that reads a general-purpose register
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-25
Submit Documentation Feedback
5.3 Functional Unit Constraints
Chapter 5—Pipeline www.ti.com
Table 5-25 shows the instruction constraints for 2-cycle DP instructions executing on
the .S unit.
Table 5-25 2-Cycle DP .S-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3
2-cycle RW W
Instruction Type Subsequent Same-Unit Instruction Executable
Single-cycle Xw ✓
DP compare ✓ ✓
2-cycle DP Xw ✓
Branch ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single cycle ✓ ✓
Load ✓ ✓
Store ✓ ✓
INTDP ✓ ✓
ADDDP/SUBDP ✓ ✓
16 × 16 multiply ✓ ✓
4-cycle ✓ ✓
MPYI ✓ ✓
MPYID ✓ ✓
MPYDP ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xw = Next instruction cannot enter E1 during cycle-write
constraint
5-26 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.3 Functional Unit Constraints
www.ti.com Chapter 5—Pipeline
Table 5-28 shows the instruction constraints for branch instructions executing on the
.S unit.
Table 5-28 Branch .S-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3 4 5 6 7 8
Branch1 R
Instruction Type Subsequent Same-Unit Instruction Executable
Single-cycle ✓ ✓ ✓ ✓ ✓ ✓ ✓
DP compare ✓ ✓ ✓ ✓ ✓ ✓ ✓
2-cycle DP ✓ ✓ ✓ ✓ ✓ ✓ ✓
Branch ✓ ✓ ✓ ✓ ✓ ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single-cycle ✓ ✓ ✓ ✓ ✓ ✓ ✓
Load ✓ ✓ ✓ ✓ ✓ ✓ ✓
Store ✓ ✓ ✓ ✓ ✓ ✓ ✓
INTDP ✓ ✓ ✓ ✓ ✓ ✓ ✓
ADDDP/SUBDP ✓ ✓ ✓ ✓ ✓ ✓ ✓
16 × 16 multiply ✓ ✓ ✓ ✓ ✓ ✓ ✓
4-cycle ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYI ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYID ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYDP ✓ ✓ ✓ ✓ ✓ ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; ✓ = Next
instruction can enter E1 during cycle
1. The branch on register instruction is the only branch instruction that reads a general-purpose register
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-27
Submit Documentation Feedback
5.3 Functional Unit Constraints
Chapter 5—Pipeline www.ti.com
5-28 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.3 Functional Unit Constraints
www.ti.com Chapter 5—Pipeline
Table 5-30 shows the instruction constraints for 4-cycle instructions executing on the
.M unit.
Table 5-30 4-Cycle .M-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3 4 5
4-cycle R W
Instruction Type Subsequent Same-Unit Instruction Executable
16 × 16 multiply ✓ Xw ✓ ✓
4-cycle ✓ ✓ ✓ ✓
MPYI ✓ ✓ ✓ ✓
MPYID ✓ ✓ ✓ ✓
MPYDP ✓ ✓ ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single-cycle ✓ ✓ ✓ ✓
Load ✓ ✓ ✓ ✓
Store ✓ ✓ ✓ ✓
DP compare ✓ ✓ ✓ ✓
2-cycle DP ✓ ✓ ✓ ✓
Branch ✓ ✓ ✓ ✓
4-cycle ✓ ✓ ✓ ✓
INTDP ✓ ✓ ✓ ✓
ADDDP/SUBDP ✓ ✓ ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xw = Next instruction cannot enter E1 during cycle-write
constraint
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-29
Submit Documentation Feedback
5.3 Functional Unit Constraints
Chapter 5—Pipeline www.ti.com
Table 5-31 shows the instruction constraints for MPYI instructions executing on the
.M unit.
Table 5-31 MPYI .M-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3 4 5 6 7 8 9 10
MPYI R R R R W
Instruction Type Subsequent Same-Unit Instruction Executable
16 × 16 multiply Xr Xr Xr ✓ ✓ ✓ Xw ✓ ✓
4-cycle Xr Xr Xr Xu Xw Xu ✓ ✓ ✓
MPYI Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓
MPYID Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓
MPYDP Xr Xr Xr Xu Xu Xu ✓ ✓ ✓
MPYSPDP Xr Xr Xr Xu Xu Xu ✓ ✓ ✓
MPYSP2DP Xr Xr Xr Xw Xw Xu ✓ ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single-cycle Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓
Load ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Store ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
DP compare Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓
2-cycle DP Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓
Branch Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓
4-cycle Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓
INTDP Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓
ADDDP/SUBDP Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode
constraint; Xw = Next instruction cannot enter E1 during cycle-write constraint; Xu = Next instruction cannot
enter E1 during cycle-other resource conflict
5-30 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.3 Functional Unit Constraints
www.ti.com Chapter 5—Pipeline
Table 5-32 shows the instruction constraints for MPYID instructions executing on the
.M unit.
Table 5-32 MPYID .M-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3 4 5 6 7 8 9 10 11
MPYID R R R R W W
Instruction Type Subsequent Same-Unit Instruction Executable
16 × 16 multiply Xr Xr Xr ✓ ✓ ✓ Xw Xw ✓ ✓
4-cycle Xr Xr Xr Xu Xw Xw ✓ ✓ ✓ ✓
MPYI Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYID Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYDP Xr Xr Xr Xu Xu Xu ✓ ✓ ✓ ✓
MPYSPDP Xr Xr Xr Xw Xu Xu ✓ ✓ ✓ ✓
MPYSP2DP Xr Xr Xr Xw Xw Xw ✓ ✓ ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single-cycle Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
Load ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Store ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
DP compare Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
2-cycle DP Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
Branch Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
4-cycle Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
INTDP Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
ADDDP/SUBDP Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode
constraint; Xw = Next instruction cannot enter E1 during cycle-write constraint; Xu = Next instruction cannot
enter E1 during cycle-other resource conflict
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-31
Submit Documentation Feedback
5.3 Functional Unit Constraints
Chapter 5—Pipeline www.ti.com
Table 5-33 shows the instruction constraints for MPYDP instructions executing on the
.M unit.
Table 5-33 MPYDP .M-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3 4 5 6 7 8 9 10 11
MPYDP R R R R W W
Instruction Type Subsequent Same-Unit Instruction Executable
16 × 16 multiply Xr Xr Xr ✓ ✓ ✓ Xw Xw ✓ ✓
4-cycle Xr Xr Xr Xu Xw Xw ✓ ✓ ✓ ✓
MPYI Xr Xr Xr Xu Xu Xu ✓ ✓ ✓ ✓
MPYID Xr Xr Xr Xu Xu Xu ✓ ✓ ✓ ✓
MPYDP Xr Xr Xr n n n ✓ ✓ ✓ ✓
MPYSPDP Xr Xr Xr Xw Xu Xu ✓ ✓ ✓ ✓
MPYSP2DP Xr Xr Xr Xw Xw Xw ✓ ✓ ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single-cycle Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
Load ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Store ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
DP compare Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
2-cycle DP Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
Branch Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
4-cycle Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
INTDP Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
ADDDP/SUBDP Xr Xr Xr ✓ ✓ ✓ ✓ ✓ ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode
constraint; Xw = Next instruction cannot enter E1 during cycle-write constraint; Xu = Next instruction cannot
enter E1 during cycle-other resource conflict
5-32 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.3 Functional Unit Constraints
www.ti.com Chapter 5—Pipeline
Table 5-34 shows the instruction constraints for MPYSP instructions executing on the
.M unit.
Table 5-34 MPYSP .M-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3 4
MPYSP R W
Instruction Type Subsequent Same-Unit Instruction Executable
MPYSPDP ✓ ✓ ✓
MPYSP2DP ✓ ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single-cycle ✓ ✓ ✓
Load ✓ ✓ ✓
Store ✓ ✓ ✓
DP compare ✓ ✓ ✓
2-cycle DP ✓ ✓ ✓
Branch ✓ ✓ ✓
4-cycle ✓ ✓ ✓
INTDP ✓ ✓ ✓
ADDDP/SUBDP ✓ ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-33
Submit Documentation Feedback
5.3 Functional Unit Constraints
Chapter 5—Pipeline www.ti.com
Table 5-35 shows the instruction constraints for MPYSPDP instructions executing on
the .M unit.
Table 5-35 MPYSPDP .M-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3 4 5 6 7
MPYSPDP R R W W
Instruction Type Subsequent Same-Unit Instruction Executable
16 × 16 multiply Xr ✓ ✓ Xw Xw ✓
MPYDP Xr Xu Xu ✓ ✓ ✓
MPYI Xr Xu Xu ✓ ✓ ✓
MPYID Xr Xu Xu ✓ ✓ ✓
MPYSP Xr Xw Xw ✓ ✓ ✓
MPYSPDP Xr Xu ✓ ✓ ✓ ✓
MPYSP2DP Xr Xw Xw ✓ ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single-cycle Xr ✓ ✓ ✓ ✓ ✓
Load Xr ✓ ✓ ✓ ✓ ✓
Store Xr ✓ ✓ ✓ ✓ ✓
DP compare Xr ✓ ✓ ✓ ✓ ✓
2-cycle DP Xr ✓ ✓ ✓ ✓ ✓
Branch Xr ✓ ✓ ✓ ✓ ✓
4-cycle Xr ✓ ✓ ✓ ✓ ✓
INTDP Xr ✓ ✓ ✓ ✓ ✓
ADDDP/SUBDP Xr ✓ ✓ ✓ ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode
constraint; Xw = Next instruction cannot enter E1 during cycle-write constraint; Xu = Next instruction cannot
enter E1 during cycle-other resource conflict
5-34 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.3 Functional Unit Constraints
www.ti.com Chapter 5—Pipeline
Table 5-36 shows the instruction constraints for MPYSP2DP instructions executing
on the .M unit.
Table 5-36 MPYSP2DP .M-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3 4 5
MPYSP2DP R R W W
Instruction Type Subsequent Same-Unit Instruction Executable
16 × 16 multiply ✓ Xw Xw ✓
MPYDP Xu ✓ ✓ ✓
MPYI Xu ✓ ✓ ✓
MPYID Xu ✓ ✓ ✓
MPYSP Xw ✓ ✓ ✓
MPYSPDP Xu ✓ ✓ ✓
MPYSP2DP Xw ✓ ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single-cycle Xr ✓ ✓ ✓
Load Xr ✓ ✓ ✓
Store Xr ✓ ✓ ✓
DP compare Xr ✓ ✓ ✓
2-cycle DP Xr ✓ ✓ ✓
Branch Xr ✓ ✓ ✓
4-cycle Xr ✓ ✓ ✓
INTDP Xr ✓ ✓ ✓
ADDDP/SUBDP Xr ✓ ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode
constraint; Xw = Next instruction cannot enter E1 during cycle-write constraint; Xu = Next instruction cannot
enter E1 during cycle-other resource conflict
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-35
Submit Documentation Feedback
5.3 Functional Unit Constraints
Chapter 5—Pipeline www.ti.com
5-36 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.3 Functional Unit Constraints
www.ti.com Chapter 5—Pipeline
Table 5-38 shows the instruction constraints for 4-cycle instructions executing on the
.L unit.
Table 5-38 4-Cycle .L-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3 4 5
4-cycle R W
Instruction Type Subsequent Same-Unit Instruction Executable
Single-cycle ✓ ✓ Xw ✓
4-cycle ✓ ✓ ✓ ✓
INTDP ✓ ✓ ✓ ✓
ADDDP/SUBDP ✓ ✓ ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single-cycle ✓ ✓ ✓ ✓
DP compare ✓ ✓ ✓ ✓
2-cycle DP ✓ ✓ ✓ ✓
4-cycle ✓ ✓ ✓ ✓
Load ✓ ✓ ✓ ✓
Store ✓ ✓ ✓ ✓
Branch ✓ ✓ ✓ ✓
16 × 16 multiply ✓ ✓ ✓ ✓
MPYI ✓ ✓ ✓ ✓
MPYID ✓ ✓ ✓ ✓
MPYDP ✓ ✓ ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xw = Next instruction cannot enter E1 during cycle-write
constraint
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-37
Submit Documentation Feedback
5.3 Functional Unit Constraints
Chapter 5—Pipeline www.ti.com
Table 5-39 shows the instruction constraints for INTDP instructions executing on the
.L unit.
Table 5-39 INTDP .L-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3 4 5 6
INTDP R W W
Instruction Type Subsequent Same-Unit Instruction Executable
Single-cycle ✓ ✓ Xw Xw ✓
4-cycle Xw ✓ ✓ ✓ ✓
INTDP Xw ✓ ✓ ✓ ✓
ADDDP/SUBDP ✓ ✓ ✓ ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
Single-cycle ✓ ✓ ✓ ✓ ✓
DP compare ✓ ✓ ✓ ✓ ✓
2-cycle DP ✓ ✓ ✓ ✓ ✓
4-cycle ✓ ✓ ✓ ✓ ✓
Load ✓ ✓ ✓ ✓ ✓
Store ✓ ✓ ✓ ✓ ✓
Branch ✓ ✓ ✓ ✓ ✓
16 × 16 multiply ✓ ✓ ✓ ✓ ✓
MPYI ✓ ✓ ✓ ✓ ✓
MPYID ✓ ✓ ✓ ✓ ✓
MPYDP ✓ ✓ ✓ ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xw = Next instruction cannot enter E1 during cycle-write
constraint
5-38 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.3 Functional Unit Constraints
www.ti.com Chapter 5—Pipeline
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode
constraint; Xw = Next instruction cannot enter E1 during cycle-write constraint; Xrw = Next instruction cannot
enter E1 during cycle-read/decode/write constraint
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-39
Submit Documentation Feedback
5.3 Functional Unit Constraints
Chapter 5—Pipeline www.ti.com
5-40 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.3 Functional Unit Constraints
www.ti.com Chapter 5—Pipeline
Table 5-42 shows the instruction constraints for store instructions executing on the .D
unit.
Table 5-42 Store .D-Unit Instruction Constraints
Instruction Execution
Cycle 1 2 3 4
Store RW
Instruction Type Subsequent Same-Unit Instruction Executable
Single-cycle ✓ ✓ ✓
Load ✓ ✓ ✓
Store ✓ ✓ ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
16 × 16 multiply ✓ ✓ ✓
MPYI ✓ ✓ ✓
MPYID ✓ ✓ ✓
MPYDP ✓ ✓ ✓
Single-cycle ✓ ✓ ✓
DP compare ✓ ✓ ✓
2-cycle DP ✓ ✓ ✓
Branch ✓ ✓ ✓
4-cycle ✓ ✓ ✓
INTDP ✓ ✓ ✓
ADDDP/SUBDP ✓ ✓ ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-41
Submit Documentation Feedback
5.3 Functional Unit Constraints
Chapter 5—Pipeline www.ti.com
Table 5-43 shows the instruction constraints for single-cycle instructions executing on
the .D unit.
Table 5-43 Single-Cycle .D-Unit Instruction Constraints
Instruction Execution
Cycle 1 2
Single-cycle RW
Instruction Type Subsequent Same-Unit Instruction Executable
Single-cycle ✓
Load ✓
Store ✓
Instruction Type Same Side, Different Unit, Both Using Cross Path Executable
16 × 16 multiply ✓
MPYI ✓
MPYID ✓
MPYDP ✓
Single-cycle ✓
DP compare ✓
2-cycle DP ✓
Branch ✓
4-cycle ✓
INTDP ✓
ADDDP/SUBDP ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle
Table 5-44 shows the instruction constraints for LDDW instructions executing on the
.D unit.
Table 5-44 LDDW Instruction With Long Write Instruction Constraints
Instruction Execution
Cycle 1 2 3 4 5 6
LDDW RW W
Instruction Type Subsequent Same-Unit Instruction Executable
Instruction with long result ✓ ✓ ✓ Xw ✓
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W =
Destinations written for the instruction;
✓ = Next instruction can enter E1 during cycle; Xw = Next instruction cannot enter E1 during cycle-write
constraint
5-42 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.4 Performance Considerations
www.ti.com Chapter 5—Pipeline
A fetch packet (FP) is a grouping of eight instructions. Each FP can be split into from
one to eight execute packets (EPs). Each EP contains instructions that execute in
parallel. Each instruction executes in an independent functional unit. The effect on the
pipeline of combinations of EPs that include varying numbers of parallel instructions,
or just a single instruction that executes serially with other code, is considered here.
In general, the number of execute packets in a single FP defines the flow of instructions
through the pipeline. Another defining factor is the instruction types in the EP. Each
type of instruction has a fixed number of execute cycles that determines when this
instruction's operations are complete. Section 5.4.2 ‘‘Multicycle NOPs’’ on page 5-44
covers the effect of including a multicycle NOP in an individual EP.
Finally, the effect of the memory system on the operation of the pipeline is considered.
The access of program and data memory is discussed, along with memory stalls.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-43
Submit Documentation Feedback
5.4 Performance Considerations
Chapter 5—Pipeline www.ti.com
Figure 5-31 Pipeline Operation: Fetch Packets With Different Numbers of Execute Packets
Clock Cycle
Fetch Packet Execute Packet
(FP) (EP) 1 2 3 4 5 6 7 8 9 10 11 12 13
n k PG PS PW PR DP DC E1 E2 E3 E4 E5
n k+1 DP DC E1 E2 E3 E4 E5
n k+2 DP DC E1 E2 E3 E4 E5
n+1 k+3 PG PS PW PR DP DC E1 E2 E3 E4
n+2 k+4 PG PS PW Pipeline PR DP DC E1 E2 E3
n+3 k+5 PG PS stall PW PR DP DC E1 E2
n+4 k+6 PG PS PW PR DP DC E1
n+5 k+7 PG PS PW PR DP DC
n+6 k+8 PG PS PW PR DP
In Figure 5-31, fetch packet n, which contains three execute packets, is shown followed
by six fetch packets (n + 1 through n + 6), each with one execute packet (containing
eight parallel instructions). The first fetch packet (n) goes through the program fetch
phases during cycles 1-4. During these cycles, a program fetch phase is started for each
of the fetch packets that follow.
In cycle 5, the program dispatch (DP) phase, the CPU scans the p-bits and detects that
there are three execute packets (k through k + 2) in fetch packet n. This forces the
pipeline to stall, which allows the DP phase to start for execute packets k + 1 and k + 2
in cycles 6 and 7. Once execute packet k + 2 is ready to move on to the DC phase (cycle
8), the pipeline stall is released.
The fetch packets n + 1 through n + 4 were all stalled so the CPU could have time to
perform the DP phase for each of the three execute packets (k through k + 2) in fetch
packet n. Fetch packet n + 5 was also stalled in cycles 6 and 7: it was not allowed to enter
the PG phase until after the pipeline stall was released in cycle 8. The pipeline continues
operation as shown with fetch packets n + 5 and n + 6 until another fetch packet
containing multiple execution packets enters the DP phase, or an interrupt occurs.
Figure 5-32 shows how a multicycle NOP drives the execution of other instructions in
the same execute packet. Figure 5-32(a) shows a NOP in an execute packet (in parallel)
with other code. The results of the LD, ADD, and MPY is available during the proper
cycle for each instruction. Hence, NOP has no effect on the execute packet.
5-44 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.4 Performance Considerations
www.ti.com Chapter 5—Pipeline
Figure 5-32(b) shows the replacement of the single-cycle NOP with a multicycle NOP
(NOP 5) in the same execute packet. The NOP5 causes no operation to perform other
than the operations from the instructions inside its execute packet. The results of the
LD, ADD, and MPY cannot be used by any other instructions until the NOP5 period
has completed.
Figure 5-32 Multicycle NOP in an Execute Packet
Cycle
(a) Execute packet LD ADD MPY NOP i
i+3
i+4
Cycle
(b) Execute packet LD ADD MPY NOP 5 i
i+1
i+2
i+3
i+4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-45
Submit Documentation Feedback
5.4 Performance Considerations
Chapter 5—Pipeline www.ti.com
Figure 5-33 shows how a multicycle NOP can be affected by a branch. If the delay
slots of a branch finish while a multicycle NOP is still dispatching NOPs into the
pipeline, the branch overrides the multicycle NOP and the branch target begins
execution five delay slots after the branch was issued.
Figure 5-33 Branching and Multicycle NOPs
Pipeline Phase
Cycle # Branch Target
1 EP1 B ... E1 PG
(A)
2 EP2 EP without branch PS
(A)
3 EP3 EP without branch PW
(A)
4 EP4 EP without branch PR
(A)
5 EP5 EP without branch DP
(A)
6 EP6 LD MPY ADD NOP5 DC
Branch
7 Branch will execute here E1
EP7
10
In one case, execute packet 1 (EP1) does not have a branch. The NOP 5 in EP6 forces
the CPU to wait until cycle 11 to execute EP7.
In the other case, EP1 does have a branch. The delay slots of the branch coincide with
cycles 2 through 6. Once the target code reaches E1 in cycle 7, it executes.
5-46 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
5.4 Performance Considerations
www.ti.com Chapter 5—Pipeline
Depending on the type of memory and the time required to complete an access, the
pipeline may stall to ensure proper coordination of data and instructions.
A memory stall occurs when memory is not ready to respond to an access from the
CPU. This access occurs during the PW phase for a program memory access and
during the E3 phase for a data memory access. The memory stall causes all of the
pipeline phases to lengthen beyond a single clock cycle, causing execution to take
additional clock cycles to finish. The results of the program execution are identical
whether a stall occurs or not. Figure 5-35 illustrates this point.
Figure 5-35 Program and Data Memory Stalls
Clock Cycle
Fetch Packet
(FP) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
n PG PS PW PR DP DC E1 E2 E3 E4 E5
n+1 PG PS PW PR DP DC E1 E2 E3 E4
n+2 PG PS PW PR DP Program DC E1 E2 E3
n+3 PG PS PW PR memory stall DP DC Data E1 E2
n+4 PG PS PW PR DP memory stall DC E1
n+5 PG PS PW PR DP DC
n+6 PG PS PW PR DP
n+7 PG PS PW PR
n+8 PG PS PW
n+9 PG PS
n+10 PG
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 5-47
Submit Documentation Feedback
5.4 Performance Considerations
Chapter 5—Pipeline www.ti.com
5-48 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Chapter 6
Interrupts
This chapter describes CPU interrupts, including reset and the nonmaskable interrupt
(NMI). It details the related CPU control registers and their functions in controlling
interrupts. It also describes interrupt processing, the method the CPU uses to detect
automatically the presence of interrupts and divert program execution flow to your
interrupt service code. Finally, this chapter describes the programming implications of
interrupts.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-1
Submit Documentation Feedback
6.1 Overview
Chapter 6—Interrupts www.ti.com
6.1 Overview
Typically, DSPs work in an environment that contains multiple external asynchronous
events. These events require tasks to be performed by the DSP when they occur. An
interrupt is an event that stops the current process in the CPU so that the CPU can
attend to the task needing completion because of the event. These interrupt sources can
be on chip or off chip, such as timers, analog-to-digital converters, or other peripherals.
Servicing an interrupt involves saving the context of the current process, completing
the interrupt task, restoring the registers and the process context, and resuming the
original process. There are eight registers that control servicing interrupts.
An appropriate transition on an interrupt pin sets the pending status of the interrupt
within the interrupt flag register (IFR). If the interrupt is properly enabled, the CPU
begins processing the interrupt and redirecting program flow to the interrupt service
routine.
These first three types are differentiated by their priorities, as shown in Table 6-1. The
reset interrupt has the highest priority and corresponds to the RESET signal. The
nonmaskable interrupt (NMI) has the second highest priority and corresponds to the
NMI signal. The lowest priority interrupts are interrupts 4-15 corresponding to the
INT4-INT15 signals. RESET, NMI, and some of the INT4-INT15 signals are mapped
to pins on C6000 devices. Some of the INT4-INT15 interrupt signals are used by
internal peripherals and some may be unavailable or can be used under software
control. Check your device-specific data sheet to see your interrupt specifications.
The CPU supports exceptions as another type of interrupt. When exceptions are
enabled, the NMI input behaves as an exception. This chapter does not deal in depth
with exceptions, as it assumes for discussion of NMI as an interrupt that they are
disabled. Chapter 7 ‘‘CPU Exceptions’’ on page 7-1 discusses exceptions including
NMI behavior as an exception.
6-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.1 Overview
www.ti.com Chapter 6—Interrupts
NMI is the second-highest priority interrupt and is generally used to alert the CPU of
a serious hardware problem such as imminent power failure.
For NMI processing to occur, the nonmaskable interrupt enable (NMIE) bit in the
interrupt enable register (IER) must be set to 1. If NMIE is set to 1, the only condition
that can prevent NMI processing is if the NMI occurs during the delay slots of a branch
(whether the branch is taken or not).
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-3
Submit Documentation Feedback
6.1 Overview
Chapter 6—Interrupts www.ti.com
On the CPU, if an NMI is recognized within an SPLOOP operation, the behavior is the
same as for an NMI with exceptions enabled. The SPLOOP operation terminates
immediately (loop does not wind down as it does in case of an interrupt). The SPLX bit
in the NMI/exception task state register (NTSR) is set for status purposes. The NMI
service routine must look at this as one of the factors on whether a return to the
interrupted code is possible. If the SPLX bit in NTSR is set, then a return to the
interrupted code results in incorrect operation. See Section 8.13 on page 8-36 for
more information.
Assuming that a maskable interrupt does not occur during the delay slots of a branch
(this includes conditional branches that do not complete execution due to a false
condition), the following conditions must be met to process a maskable interrupt:
• The global interrupt enable bit (GIE) bit in the control status register (CSR) is set
to 1.
• The NMIE bit in the interrupt enable register (IER) is set to 1.
• The corresponding interrupt enable (IE) bit in the IER is set to 1.
• The corresponding interrupt occurs, which sets the corresponding bit in the
interrupt flags register (IFR) to 1 and there are no higher priority interrupt flag
(IF) bits set in the IFR.
6-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.1 Overview
www.ti.com Chapter 6—Interrupts
The addresses and contents of the IST are shown in Figure 6-1. Because each fetch
packet contains eight 32-bit instruction words (or 32 bytes), each address in the table is
incremented by 32 bytes (20h) from the one adjacent to it.
Figure 6-1 Interrupt Service Table
Program memory
Note—The ISFP should be exactly 8 words long. To prevent the compiler from
using compact instructions (see section 3.10 ‘‘Compact Instructions on the
CPU’’ on page 3-29), the interrupt service table should be preceded by a
.nocmp directive. See the TMS320C6000 Assembly Language Tools User’s
Guide (SPRU186).
If the NOP 5 was not in the routine, the CPU would execute the next five
execute packets (some of which are likely to be associated with the next ISFP)
because of the delay slots associated with the B IRP instruction. See section
5.2.6 ‘‘Branch Instructions’’ for more information.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-5
Submit Documentation Feedback
6.1 Overview
Chapter 6—Interrupts www.ti.com
Program memory
If the interrupt service routine for an interrupt is too large to fit in a single fetch packet,
a branch to the location of additional interrupt service routine code is required.
Figure 6-3 shows that the interrupt service routine for INT4 was too large for a single
fetch packet, and a branch to memory location 1234h is required to complete the
interrupt service routine.
Note—The instruction B LOOP branches into the middle of a fetch packet and
processes code starting at address 1234h. The CPU ignores code from address
1220h−1230h, even if it is in parallel to code at address 1234h.
6-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.1 Overview
www.ti.com Chapter 6—Interrupts
Figure 6-3 Interrupt Service Table With Branch to Additional Interrupt Service Code Located Outside the IST
IST
1248h Instr14
1258h -
125Ch -
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-7
Submit Documentation Feedback
6.1 Overview
Chapter 6—Interrupts www.ti.com
Because the HPEINT field in ISTP gives the value of the highest priority interrupt that
is both pending and enabled, the whole of ISTP gives the address of the highest priority
interrupt that is both pending and enabled
IST
0
RESET ISFP
Program memory
6-8 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.1 Overview
www.ti.com Chapter 6—Interrupts
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-9
Submit Documentation Feedback
6.2 Globally Enabling and Disabling Interrupts
Chapter 6—Interrupts www.ti.com
On the CPU, there is one physical GIE bit that is mapped to bit 0 of both CSR and TSR.
Similarly, there is one physical PGIE bit. It is mapped as CSR.PGIE (bit 1) and
ITSR.GIE (bit 0). Modification to either of these bits is reflected in both of the
mappings. In the following discussion, references to the GIE bit in CSR also refer to the
GIE bit in TSR, and references to the PGIE bit in CSR also refer to the GIE bit in ITSR.
The global interrupt enable (GIE) allows you to enable or disable all maskable
interrupts by controlling the value of a single bit. GIE is bit 0 of both the control status
register (CSR) and the task state register (TSR).
• GIE = 1 enables the maskable interrupts so that they are processed.
• GIE = 0 disables the maskable interrupts so that they are not processed.
The CPU detects interrupts in parallel with instruction execution. As a result, the CPU
may begin interrupt processing in the same cycle that an MVC instruction writes 0 to
GIE to disable interrupts. The PGIE bit (bit 1 of CSR) records the value of GIE after the
CPU begins interrupt processing, recording whether the program was in the process of
disabling interrupts.
During maskable interrupt processing, the CPU finishes executing the current execute
packet. The CPU then copies the current value of GIE to PGIE, overwriting the
previous value of PGIE. The CPU then clears GIE to prevent another maskable
interrupt from occurring before the handler saves the machine’s state. (Section 6.6.2
on page 6-27 discusses nesting interrupts.)
When the interrupt handler returns to the interrupted code with the B IRP instruction,
the CPU copies PGIE back to GIE. When the interrupted code resumes, GIE reflects the
last value written by the interrupted code.
Because interrupt detection occurs in parallel with CPU execution, the CPU can take
an interrupt in the cycle immediately following an MVC instruction that clears GIE.
The behavior of PGIE and the B IRP instruction ensures, however, that interrupts do
not occur after subsequent execute packets. Consider the code in Example 6-2.
Example 6-2 Interrupts Versus Writes to GIE
;Assume GIE = 1
MVC CSR,B0 ; (1) Get CSR
AND -2,B0,B0; (2) Get ready to clear GIE
MVC B0,CSR ; (3) Clear GIE
ADD A0,A1,A2; (4)
ADD A3,A4,A5; (5)
In Example 6-2, the CPU may service an interrupt between instructions 1 and 2,
between instructions 2 and 3, or between instructions 3 and 4. The CPU will not service
an interrupt between instructions 4 and 5.
6-10 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.2 Globally Enabling and Disabling Interrupts
www.ti.com Chapter 6—Interrupts
On the CPU, programs must directly manipulate the GIE bit in CSR to disable and
enable interrupts. Example 6-3 and Example 6-4 show code examples for disabling and
enabling maskable interrupts globally, respectively.
The CPU copies TSR to ITSR, thereby, saving the old value of GIE. It then clears
TSR.GIE. (ITSR.GIE is physically the same bit as CSR.PGIE and TSR.GIE is physically
the same bit as CSR.GIE.) When returning from an interrupt with the B IRP
instruction, the CPU restores the TSR state by copying ITSR back to TSR.
The CPU provides two new instructions that allow for simpler and safer manipulation
of the GIE bit.
• The DINT instruction disables interrupts by:
– Copies the value of CSR.GIE (and TSR.GIE) to TSR.SGIE
– Clears CSR.GIE and TSR.GIE to 0 (disabling interrupts immediately)
The CPU will not service an interrupt between the execute packet containing
DINT and the execute packet that follows it.
• The RINT instruction restores interrupts to the previous state by:
– Copies the value of TSR.SGIE to CSR.GIE (and TSR.GIE)
– Clears TSR.SGIE to 0
If SGIE bit in TSR when RINT executes, interrupts are enabled immediately and the
CPU may service an interrupt in the cycle immediately following the execute packet
containing RINT.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-11
Submit Documentation Feedback
6.2 Globally Enabling and Disabling Interrupts
Chapter 6—Interrupts www.ti.com
Example 6-5 illustrates the use and timing of the DINT instruction in disabling
maskable interrupts globally and Example 6-6 shows how to enable maskable
interrupts globally using the complementary RINT instruction.
;Assume GIE = 1
ADD B0,1,B0; Interrupt possible between ADD and DINT
DINT ; No interrupt between DINT and SUB
SUB B0,1,B0;
Note—The use of DINT and RINT instructions in a nested manner, like the
following code:
DINT
DINT
RINT
RINT
leaves interrupts disabled after the second RINT instruction. The successive
use of the DINT instruction leaves the TSR.SGIE bit cleared to 0, so the RINT
instructions copy zero to the GIE bit.
6-12 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.3 Individual Interrupt Control
www.ti.com Chapter 6—Interrupts
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-13
Submit Documentation Feedback
6.3 Individual Interrupt Control
Chapter 6—Interrupts www.ti.com
Note—Any write to the ISR or ICR (by the MVC instruction) effectively has one
delay slot because the results cannot be read (by the MVC instruction) in IFR
until two cycles after the write to ISR or ICR.
Any write to ICR is ignored by a simultaneous write to the same bit in ISR.
Example 6-10 and Example 6-11 show code examples to set and clear individual
interrupts.
Example 6-10 Code to Set an Individual Interrupt (INT6) and Read the Flag Register
MVK 40h,B3
MVC B3,ISR
NOP
MVC IFR,B4
Example 6-11 Code to Clear an Individual Interrupt (INT6) and Read the Flag Register
MVK 40h,B3
MVC B3,ICR
NOP
MVC IFR,B4
The program execution begins at the address specified by the ISTB field in ISTP.
6-14 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.3 Individual Interrupt Control
www.ti.com Chapter 6—Interrupts
The NTSR register will be copied back into the TSR register during the transfer of
control out of the interrupt.
The ITSR will be copied back into the TSR during the transfer of control out of the
interrupt.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-15
Submit Documentation Feedback
6.4 Interrupt Detection and Processing
Chapter 6—Interrupts www.ti.com
When an interrupt occurs, it sets a flag in the interrupt flag register (IFR). Depending
on certain conditions, the interrupt may or may not be processed. This section
discusses the mechanics of setting the flag bit, the conditions for processing an
interrupt, and the order of operation for detecting and processing an interrupt. The
similarities and differences between reset and nonreset interrupts are also discussed.
6-16 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.4 Interrupt Detection and Processing
www.ti.com Chapter 6—Interrupts
CPU bdry
INTm at
IFm
EXC
TSR.GIE
TSR.XEN
TSR.INT
TSR.EXC
TSR
v
ITSR
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Contains no branch
n+3 PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+4 PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+5 PG PS PW PR DP DC E1
n+6 PG PS PW PR DP E2
n+7 PG PS PW PR DP
n+8 PG PS PW PR Annulled Instructions
n+9 PG PS PW
PG PS
n+10 PG
n+11 Cycles 6-14: Nonreset (A)
interrupt processing is disabled
ISFP PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
(A) After this point, interrupts are still disabled. All nonreset interrupts are disabled when NMIE = 0. All maskable interrupts are disabled when GIE = 0.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-17
Submit Documentation Feedback
6.4 Interrupt Detection and Processing
Chapter 6—Interrupts www.ti.com
In Figure 6-4, IFm is set during CPU cycle 6. You could attempt to clear bit IFm by
using an MVC instruction to write a 1 to bit m of ICR in execute packet n + 3 (during
CPU cycle 4). However, in this case, the automated write by the interrupt detection
logic takes precedence and IFm remains set.
Figure 6-4 assumes INTm is the highest priority pending interrupt and is enabled by
the GIE and NMIE bits, as necessary. If it is not the highest priority pending interrupt,
IFm remains set until either you clear it by writing a 1 to bit m of ICR, or the processing
of INTm occurs.
Any pending interrupt will be taken as soon as pending branches are completed.
6-18 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.4 Interrupt Detection and Processing
www.ti.com Chapter 6—Interrupts
Figure 6-5 Return from Interrupt Execution and Processing: Pipeline Operation
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
EXC
TSR.GIE
TSR.EXC
ITSR.SPLX 1 if SPLOOP was interrupted; sampled for return target in SPLOOP state machine
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7
B IRP DP DC E1 E2 E3 E4 E5 E6
n+2 PR DP DC E1 E2 E3 E4 E5
n+3 PW PR DP DC E1 E2 E3 E4
n+4 PS PW PR DP DC E1 E2 E3
n+5 PG PS PW PR DP DC E1 E2
n+6 PG PS PW PR DP DC E1
IRP target PG PS PW PR DP DC E1
t+1 PG PS PW PR DP DC E1
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
The current execution mode is held in a piped series of register bits allowing a change
in the mode to progress from the PS phase through the E1 phase. Fetches from program
memory use the PS-valid register which is only loaded at the start of a transfer of
control. This value is an output on the program memory interface and is shown in the
timing diagram as PCXM. As the target execute packet progresses through the pipeline,
the new mode is registered for that stage. Each stage uses its registered version of the
execution mode. The field in TSR is the E1-valid version of CXM. It always indicates
the execution mode for the instructions executing in E1. The mode is used in the data
memory interface, and is registered for all load/store instructions when they execute in
E1. This is shown in the timing diagram as DCXM. Note that neither PCXM nor
DCXM is visible in any register to you.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-19
Submit Documentation Feedback
6.4 Interrupt Detection and Processing
Chapter 6—Interrupts www.ti.com
6-20 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.4 Interrupt Detection and Processing
www.ti.com Chapter 6—Interrupts
Figure 6-6 CPU Nonmaskable Interrupt Detection and Processing: Pipeline Operation
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
NMI at
CPU bdry
NMIF
EXC
IER.NMIE
TSR.GEE
TSR.INT
TSR.EXC
TSR
v
NTSR
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Contains no branch
n+3 PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+4 PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+5 PG PS PW PR DP DC E1
n+6 PG PS PW PR DP E2
n+7 PG PS PW PR DP
n+8 PG PS PW PR Annulled Instructions
n+9 PG PS PW
n+10 PG PS
n+11 PG
Cycles 6-14: Nonreset
interrupt processing is disabled (A)
ISFP PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
(A) After this point, interrupts are still disabled. All nonreset interrupts are disabled when NMIE = 0. All maskable interrupts are disabled when GIE = 0.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-21
Submit Documentation Feedback
6.4 Interrupt Detection and Processing
Chapter 6—Interrupts www.ti.com
Figure 6-7 CPU Return from Nonmaskable Interrupt Execution and Processing: Pipeline Operation
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
EXC
TSR.XEN
TSR.EXC
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7
B NRP DP DC E1 E2 E3 E4 E5 E6
n+2 PR DP DC E1 E2 E3 E4 E5
n+3 PW PR DP DC E1 E2 E3 E4
n+4 PS PW PR DP DC E1 E2 E3
n+5 PG PS PW PR DP DC E1 E2
n+6 PG PS PW PR DP DC E1
IRP target PG PS PW PR DP DC E1
t+1 PG PS PW PR DP DC E1
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
6-22 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.4 Interrupt Detection and Processing
www.ti.com Chapter 6—Interrupts
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-23
Submit Documentation Feedback
6.4 Interrupt Detection and Processing
Chapter 6—Interrupts www.ti.com
Note that a nested exception can force an internally-generated reset that does not reset
all the registers to their hardware reset state. See ‘‘Nested Exceptions’’ on page 7-12 for
more information.
During CPU cycles 15-21 of Figure 6-8, the following reset processing actions occur:
• Processing of subsequent nonreset interrupts is disabled because the GIE and
NMIE bits are cleared.
• A branch to the address held in ISTP (the pointer to the ISFP for INT0) is forced
into the E1 phase of the pipeline during cycle 16.
• IF0 is cleared during cycle 17.
Note—Code that starts running after reset must explicitly enable the GIE bit,
the NMIE bit, and IER to allow interrupts to be processed.
Execut
packete
n E1 E2
n+1 DC E1
n+2 DP DC
n+3 PR DP Pipeline flush
n+4 PW PR
n+5 PS PW Cycles 15 - 21:
Nonreset interrupt (B)
n+6 PG PS
processing is disabled
n+7 PG
Reset ISFP PG PS PW PR DP DC E1
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
(A) IF0 is set on the next CPU cycle boundary after a 4-clock cycle delay after the rising edge of .
(B) After this point, interrupts are still disabled. All nonreset interrupts are disabled when NMIE = 0. All maskable interrupts are disabled when GIE = 0.
6-24 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.5 Performance Considerations
www.ti.com Chapter 6—Interrupts
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-25
Submit Documentation Feedback
6.6 Programming Considerations
Chapter 6—Interrupts www.ti.com
To avoid unpredictable operation, you must employ the single assignment method in
code that can be interrupted. When an interrupt occurs, all instructions entering E1
prior to the beginning of interrupt processing are allowed to complete execution
(through E5). All other instructions are annulled and refetched upon return from
interrupt. The instructions encountered after the return from the interrupt do not
experience any delay slots from the instructions prior to processing the interrupt. Thus,
instructions with delay slots prior to the interrupt can appear, to the instructions after
the interrupt, to have fewer delay slots than they actually have.
Example 6-14 shows a code fragment that stores two variables into A1 using multiple
assignment. Example 6-15 shows equivalent code using the single assignment
programming method, which stores the two variables into two different registers.
For example, before reaching the code in Example 6-14, suppose that register A1
contains 0 and register A0 points to a memory location containing a value of 10. The
ADD instruction, which is in a delay slot of the LDW, sums A2 with the value in A1 (0)
and the result in A3 is just a copy of A2. If an interrupt occurred between the LDW and
ADD, the LDW would complete the update of A1 (10), the interrupt would be
processed, and the ADD would sum A1 (10) with A2 and place the result in A3 (equal
to A2 + 10). Obviously, this situation produces incorrect results.
In Example 6-15, the single assignment method is used. The register A1 is assigned only
to the ADD input and not to the result of the LDW. Regardless of the value of A6 with
or without an interrupt, A1 does not change before it is summed with A2. Result A3 is
equal to A2.
6-26 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.6 Programming Considerations
www.ti.com Chapter 6—Interrupts
Also, there may be times when you want to allow an interrupt service routine to be
interrupted by another (particularly higher priority) interrupt. Even though the
processor by default does not allow interrupt service routines to be interrupted unless
the source is an NMI, it is possible to nest interrupts under software control. To allow
nested interrupts, the interrupt service routine must perform the following initial steps
in addition to its normal work of saving any registers (including control registers) that
it modifies:
1. The contents of IRP (or NRP) must be saved
2. The contents of the PGIE bit must be saved
3. The contents of ITSR must be saved
4. The GIE bit must be set to 1
Prior to returning from the interrupt service routine, the code must restore the registers
saved above as follows:
1. The GIE bit must be first cleared to 0
2. The PGIE bit saved value must be restored
3. The contents of ITSR must be restored
4. The IRP (or NRP) saved value must be restored
Although steps 2, 3, and 4 above may be performed in any order, it is important that
the GIE bit is cleared first. This means that the GIE and PGIE bits must be restored with
separate writes to CSR. If these bits are not restored separately, then it is possible that
the PGIE bit is overwritten by nested interrupt processing just as interrupts are being
disabled.
Note—When coding nested interrupts for the CPU, the ITSR should be saved
and restored to prevent corruption by the nested interrupt.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-27
Submit Documentation Feedback
6.6 Programming Considerations
Chapter 6—Interrupts www.ti.com
The code sequence begins by copying the address of the highest priority interrupt from
the ISTP to the register B2. The next instruction extracts the number of the interrupt,
which is used later to clear the interrupt. The branch to the interrupt service routine
comes next with a parallel instruction to set up the ICR word.
The last five instructions fill the delay slots of the branch. First, the 32-bit return
address is stored in the B2 register and then copied to the interrupt return pointer
(IRP). Finally, the number of the highest priority interrupt, stored in B1, is used to shift
the ICR word in B1 to clear the interrupt.
6.6.4 Traps
A trap behaves like an interrupt, but is created and controlled with software. The trap
condition can be stored in any one of the conditional registers: A1, A2, B0, B1, or B2. If
the trap condition is valid, a branch to the trap handler routine processes the trap and
the return.
Example 6-17 and Example 6-18 show a trap call and the return code sequence,
respectively. In the first code sequence, the address of the trap handler code is loaded
into register B0 and the branch is called. In the delay slots of the branch, the context is
saved in the B0 register, the GIE bit is cleared to disable maskable interrupts, and the
return pointer is stored in the B1 register.
The trap is processed with the code located at the address pointed to by the label
TRAP_HANDLER. If the B0 or B1 registers are needed in the trap handler, their
contents must be stored to memory and restored before returning. The code shown in
Example 6-18 should be included at the end of the trap handler code to restore the
context prior to the trap and return to the TRAP_RETURN address.
6-28 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
6.6 Programming Considerations
www.ti.com Chapter 6—Interrupts
B B1 ; return
MVC B0,CSR ; restore CSR
NOP 4 ; delay slots
Often traps are used to handle unexpected conditions in the execution of the code. The
CPU provides explicit exception handling support which may be used for this purpose.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 6-29
Submit Documentation Feedback
6.6 Programming Considerations
Chapter 6—Interrupts www.ti.com
6-30 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Chapter 7
CPU Exceptions
This chapter describes CPU exceptions on the CPU. It details the related CPU control
registers and their functions in controlling exceptions. It also describes exception
processing, the method the CPU uses to detect automatically the presence of exceptions
and divert program execution flow to your exception service code. Finally, the chapter
describes the programming implications of exceptions.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 7-1
Submit Documentation Feedback
7.1 Overview
Chapter 7—CPU Exceptions www.ti.com
7.1 Overview
The exception mechanism on the CPU is intended to support error detection and
program redirection to error handling service routines. Error signals generated outside
of the CPU are consolidated to one exception input to the CPU. Exceptions generated
within the CPU are consolidated to one internal exception flag with information as to
the cause in a register. Fatal errors detected outside of the CPU are consolidated and
incorporated into the NMI input to the CPU.
Check the device-specific data manual for your external exception specifications.
For NMI processing to occur, the nonmaskable interrupt enable (NMIE) bit in the
interrupt enable register (IER) must be set to 1. If the NMIE bit is set to 1, the only
condition that can prevent NMI processing is the CPU being stalled.
The NMIE bit is cleared to 0 at reset to prevent interruption of the reset processing. It
is cleared at the occurrence of an NMI to prevent another NMI from being processed.
You cannot manually clear NMIE, but you can set NMIE to allow nested NMIs. While
NMIE is cleared, all external exceptions are disabled. Internal exceptions are not
affected by NMIE.
When NMI is recognized as pending, the NMI exception flag (NXF) bit in the
exception flag register (EFR) is set. Unlike the NMIF bit in the interrupt flag register
(IFR), the NXF bit is not cleared automatically upon servicing of the NMI. The NXF bit
remains set until manually cleared in the exception service routine.
7-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
7.1 Overview
www.ti.com Chapter 7—CPU Exceptions
Transitions on the NMI input while the NXF bit is set are ignored. In the event an
attempt to clear the flag using the MVC instruction coincides with the automated write
by the exception detection logic, the automatic write takes precedence and the NXF bit
remains set.
When EXCEP is recognized as pending, the external exception flag (EXF) bit in EFR is
set. The EXF bit remains set until manually cleared in the exception service routine.
Instructions that have already entered E1 before a context switch begins are allowed to
complete. Any internal exceptions generated by these completing instructions are
ignored. This is true for both interrupt and exception context switches.
When an internal exception is recognized as pending, the internal exception flag (IXF)
bit in EFR is set. The IXF bit remains set until manually cleared in the exception service
routine.
In general, the exception service routine for an exception is too large to fit in a single
fetch packet, so a branch to the location of additional exception service routine code is
required. This is shown in Figure 7-1.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 7-3
Submit Documentation Feedback
7.1 Overview
Chapter 7—CPU Exceptions www.ti.com
7-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
7.2 Exception Control
www.ti.com Chapter 7—CPU Exceptions
Figure 7-1 Interrupt Service Table With Branch to Additional Exception Service Code Located Outside the IST
IST
xxxx 000h RESET ISFP ISFP for exceptions
xxxx 020h NMI ISFP 020h Instr1
xxxx 040h Reserved 024h Instr2
The exception service routine xxxx 060h Reserved 028h B 1234h
includes this instruction extension
of the exception ISFP. xxxx 080h INT4 ISFP 02Ch Instr4
xxxx 0A0h INT5 ISFP 030h Instr5
1220h -
xxxx 0C0h INT6 ISFP 034h Instr6
1224h -
xxxx 0E0h INT7 ISFP 038h Instr7
1228h -
xxxx 100h INT8 ISFP 03Ch Instr8
122Ch -
xxxx 120h INT9 ISFP
1230h -
xxxx 140h INT10 ISFP
1234h Instr9
xxxx 160h INT11 ISFP
1238h B NRP
xxxx 180h INT12 ISFP
123Ch Instr11
xxxx 1A0h INT13 ISFP
xxxx 1C0h INT14 ISFP
1240h Instr12
xxxx 1E0h INT15 ISFP
1244h Instr13
1248h Instr14
124Ch Instr15 Additional ISFP for NMI
External exceptions are also qualified by the NMIE bit in IER. An external exception
(EXCEP or NMI) can trigger exception processing only if this bit is set. Internal
exceptions are not affected by NMIE. The IER is shown in Figure 2-7 and described in
Table 2-12 on page 2-19. The EXCEP exception input can also be disabled by clearing
the XEN bit in TSR.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 7-5
Submit Documentation Feedback
7.2 Exception Control
Chapter 7—CPU Exceptions www.ti.com
When NMIE = 0, all interrupts and external exceptions are disabled, preventing
interruption of an exception service routine. The NMIE bit is cleared at reset to prevent
any interruption of processor initialization until you enable exceptions. After reset, you
must set the NMIE bit to enable external exceptions and to allow INT15-INT4 to be
enabled by the GIE bit and the appropriate IER bit. You cannot manually clear the
NMIE bit; the NMIE bit is unaffected by a write of 0. The NMIE bit is also cleared by
the occurrence of an NMI. If cleared, the NMIE bit is set only by completing a B NRP
instruction or by a write of 1 to NMIE.
7-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
7.2 Exception Control
www.ti.com Chapter 7—CPU Exceptions
Execution of a B NRP instruction causes the saved context in NTSR to be loaded into
TSR to resume execution. Similarly, a B IRP instruction restores context from ITSR
into TSR.
Information about the CPU context at the point of an exception is retained in NTSR.
Table 7-2 shows the behavior for each bit in NTSR. The information in NTSR is used
upon execution of a B NRP instruction to restore the CPU context before resuming the
interrupted instruction execution. The HWE bit in NTSR is set when an internal or
external exception is taken. The HWE bit is cleared by the SWE and SWENR
instructions. The NTSR is shown in Figure 2-21 and described in Table 2-19 on
page 2-29.
Table 7-2 NTSR Field Behavior When an Exception is Taken
Bit Field Action
0 GIE GIE bit in TSR at point exception is taken.
1 SGIE SGIE bit in TSR at point exception is taken.
2 GEE GEE bit in TSR at point exception is taken (must be 1).
3 XEN XEN bit in TSR at point exception is taken.
7-6 CXM CXM bits in TSR at point exception is taken.
9 INT INT bit in TSR at point exception is taken.
10 EXC EXC bit in TSR at point exception is taken (must be 0).
14 SPLX Terminated an SPLOOP
15 IB Exception occurred while interrupts were blocked.
16 HWE Hardware exception taken (NMI, EXCEP, or internal).
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 7-7
Submit Documentation Feedback
7.3 Exception Detection and Processing
Chapter 7—CPU Exceptions www.ti.com
It is not always possible to safely exit the exception handling routine. Conditions that
can prevent a safe return from exceptions include:
• SPLOOPs that are terminated by an exception cannot be resumed correctly. The
SPLX bit in NTSR should be verified to be 0 before returning.
• Exceptions that occur when interrupts are blocked cannot be resumed correctly.
The IB bit in NTSR should be verified to be 0 before returning.
• Exceptions that occur at any point in the code that cannot be interrupted safely
(for example, a tight loop containing multiple assignments) cannot be safely
returned to. The compiler will normally disable interrupts at these points in the
program; check the GIE bit in NTSR to be 1 to verify that this condition is met.
If the exception cannot be safely returned from, the appropriate response will be
different based on the specific cause of the exception. In some cases, a warm reset will
be required. In other cases, restarting a user task may be sufficient.
The NRP contains the 32-bit address of the first execute packet in the program flow that
was not executed because of an exception. Although you can write a value to this
register, any subsequent exception processing may overwrite that value. The NRP is
shown in Figure 2-12 on page 2-23.
7-8 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
7.3 Exception Detection and Processing
www.ti.com Chapter 7—CPU Exceptions
Figure 7-2 assumes EXCEP is enabled by the XEN, NMIE, and GEE bits, as necessary.
Figure 7-2 External Exception (EXCEP) Detection and Processing: Pipeline Operation
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
EXCEP at
CPU bdry
EFR.EXF
EXC
IER.NMIE
TSR.XEN
TSR.EXC
TSR
v
NTSR
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+3 PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+4 PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+5 PG PS PW PR DP DC E1
n+6 PG PS PW PR DP E2
n+7 PG PS PW PR DP
n+8 PG PS PW PR Annulled Instructions
n+9 PG PS PW
n+10 PG PS
n+11 PG
Cycles 6-14: Nonreset
interrupt processing is disabled
ISFP PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 7-9
Submit Documentation Feedback
7.3 Exception Detection and Processing
Chapter 7—CPU Exceptions www.ti.com
• NMIE = 1
• GEE = 1
• For EXCEP, XEN = 1
Any pending exception will be taken as soon as any stalls are completed.
When control is transferred to the interrupt processing sequence the context needed to
return from the ISR is saved in ITSR. TSR is set for the default interrupt processing
context. Table 7-3 shows the behavior for each bit in TSR. Figure 7-2 shows the timing
of the changes to the TSR bits as well as the CPU outputs used in exception processing.
Fetches from program memory use the PS-valid register that is only loaded at the start
of a context switch. This value is an output on the program memory interface and is
shown in the timing diagram as PCXM. As the target execute packet progresses
through the pipeline, the new mode is registered for that stage. Each stage uses its
registered version of the execution mode. The field in TSR is the E1-valid version of
CXM. It always indicates the execution mode for the instructions executing in E1. The
mode is used in the data memory interface, and is registered for all load/store
instructions when they execute in E1. This is shown in the timing diagram as DCXM.
Figure 7-3 shows the transitions in the case of a return from exception initiated by
executing a B NRP instruction.
7-10 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
7.3 Exception Detection and Processing
www.ti.com Chapter 7—CPU Exceptions
EXC
IER.NMIE
TSR.EXC
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7
B NRP DP DC E1 E2 E3 E4 E5 E6
n+2 PR DP DC E1 E2 E3 E4 E5
n+3 PW PR DP DC E1 E2 E3 E4
n+4 PS PW PR DP DC E1 E2 E3
n+5 PG PS PW PR DP DC E1 E2
n+6 PG PS PW PR DP DC E1
NRP target PG PS PW PR DP DC E1
t+1 PG PS PW PR DP DC E1
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 7-11
Submit Documentation Feedback
7.4 Performance Considerations
Chapter 7—CPU Exceptions www.ti.com
The NTSR, ITSR, IRP, and the NRP can be tested in the users boot code to determine
if reset pin initiated reset or a reset caused by a nested exception.
7-12 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
7.4 Performance Considerations
www.ti.com Chapter 7—CPU Exceptions
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
NMI at CPU
bdry
EFR.NXF
EXC
IER.NMIE
TSR.XEN
TSR.EXC
TSR
v
NTSR
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+3 PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+4 PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+5 PG PS PW PR DP DC E1
n+6 PG PS PW PR DP E2
n+7 PG PS PW PR DP
n+8 PG PS PW PR Annulled Instructions
n+9 PG PS PW
n+10 PG PS
n+11 PG
Cycles 6-14: Nonreset
interrupt processing is disabled
ISFP PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 7-13
Submit Documentation Feedback
7.4 Performance Considerations
Chapter 7—CPU Exceptions www.ti.com
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
EXCEP at
CPU bdry
EFR.EXF
NMI at CPU
bdry
EFR.NXF
EXC
IER.NMIE
TSR.XEN
TSR.EXC
TSR
Execute NTSR
packet
n DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+3 PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+4 PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+5 PG PS PW PR DP DC E1
n+6 PG PS PW PR DP DC TSR
n+7 PG PS PW PR DP
Annulled Instructions ITSR
n+8 PG PS PW PR
n+9 PG PS PW &(ISR)
n+10 PG PS
n+11 IRP
PG
ISR PG PS PW PR DP DC E1
ISR+1 &(n+5) PG PS PW PR DP DC
ISR+2 PG PS PW PR DP
NRP PG PS PW PR Annulled Instructions
ISR+3
PG PS PW
ISR+4 PG PS
ISR+5 PG
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
7-14 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
7.5 Programming Considerations
www.ti.com Chapter 7—CPU Exceptions
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 7-15
Submit Documentation Feedback
7.5 Programming Considerations
Chapter 7—CPU Exceptions www.ti.com
The SWENR instruction causes a change in control to the address contained in REP. It
should have been previously initialized to a correct value by a privileged supervisor
mode process.
7-16 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Chapter 8
This chapter describes the software pipelined loop (SPLOOP) buffer hardware and
software mechanisms.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-1
Submit Documentation Feedback
8.1 Software Pipelining
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
The instruction schedule for a modulo scheduled loop has three components: a kernel,
a prolog, and an epilog (Figure 8-1). The kernel is the instruction schedule that executes
the pipeline steady state. The prolog and epilog are the instruction schedules that setup
and drain the execution of the loop kernel. In Figure 8-1, the steady state has four
stages, each from a different iteration, executing in parallel. A single iteration produces
a result in the time it takes four stages to complete, but in the steady state of the software
pipeline, a result is available every stage (that is, every ii cycles).
The first prolog stage, P0, is equal to the first loop stage, S0. Each prolog stage, Pn
(where n > 0), is made up of the loop stage, Sn, plus all the loop stages in the previous
prolog stage, Pn - 1. The kernel includes all the loop stages. The first epilog stage, E0, is
made up of the kernel stage minus the first loop stage, S0. Each epilog stage, En (where
n > 0), is made up of the previous epilog stage, En - 1, minus the loop stage, Sn.
The dynamic length (dynlen) of the loop is the number of instruction cycles required
for one iteration of the loop to complete. The length of the prolog is (dynlen −ii). The
length of the epilog is the same as the length of the prolog.
Figure 8-1 Software Pipelined Execution Flow
Execution
flow Code layout
iter 0
P0 Stage 0 iter 1
iter n-1
8-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.2 Software Pipelining
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
8.3 Terminology
The following terminology is used in the discussion in this chapter.
• Iteration interval (ii) is the interval (in instruction cycles) between successive
iterations of the loop.
• A stage is the code executed in one iteration interval.
• Dynamic length (dynlen) is the length (in instruction cycles) of a single iteration
of the loop. It is therefore equal to the number of stages times the iteration
interval.5
• The kernel is the period when the loop is executing in a steady state with the
maximum number of loop iterations executing simultaneously. For example: in
Figure 8-1 the kernel is the set of instructions contained in stage 0, stage 1, stage,
2, and stage 3.
• The prolog is the period before the loop reaches the kernel in which the loop is
winding up. The length of the prolog will by the dynamic length minus the
iteration interval (dynlen - ii).
• The epilog is the period after the loop leaves the kernel in which the loop is
winding down. The length of the prolog will by the dynamic length minus the
iteration interval (dynlen - ii).
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-3
Submit Documentation Feedback
8.4 SPLOOP Hardware Support
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
There are two LBCs to support overlapped nested loops. LBC is not a user-visible
register.
There is a 4 cycle latency between when ILC is loaded and when its contents are
available for use. When used with the SPLOOP instruction, it should be loaded 4 cycles
before the SPLOOP instruction is encountered. ILC must be loaded explicitly using the
MVC instruction.
8.4.5 Task State Register (TSR), Interrupt Task State Register (ITSR), and NMI/Exception
Task State Register (NTSR)
The SPLX bit in the task state register (TSR) indicates whether an SPLOOP is currently
executing or not executing.
When an interrupt occurs, the contents of TSR (including the SPLX bit) is copied to the
interrupt task state register (ITSR).
See section 2.9.14 on page 2-33 for more information on TSR. See section 2.9.8 on
page 2-28 for more information on ITSR. See 2.9.9 on page 2-29 for more information
on NTSR.
8-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.5 SPLOOP-Related Instructions
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
When you know in advance the number of iterations that the loop will execute, you can
use the SPLOOP or SPLOOPD instructions. If you do not know the exact number of
iterations that the loop should execute, you can use the SPLOOPW in a fashion similar
to a do−while loop.
The SPLOOP(D/W) instructions each clear the loop buffer count register (LBC), load
the iteration interval (ii), and start the LBC counting.
The ii parameter is the iteration interval that specifies the interval (in instruction
cycles) between successive iterations of the loop.
The SPLOOP instruction is used when the number of loop iterations is known in
advance. The number of loop iterations is determined by the value loaded to the inner
loop count register (ILC). ILC should be loaded with an initial value 4 cycles before the
SPLOOP instruction is encountered.
The (optional) conditional predication is used to indicate when and if a nested loop
should be reloaded. The contents of the reload inner loop counter (RILC) is copied to
ILC when either a SPKERNELR or a SPMASKR instruction is executed with the
predication condition on the SPLOOP instruction true. If the loop is not nested, then
the conditional predication should not be used.
The ii parameter is the iteration interval which specifies the interval (in instruction
cycles) between successive iterations of the loop.
The SPLOOPD instruction is used to initiate a loop buffer operation when the known
minimum iteration count of the loop is great enough that the inner loop count register
(ILC) can be loaded in parallel with the SPLOOPD instruction and the 4 cycle latency
will have passed before the last iteration of the loop.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-5
Submit Documentation Feedback
8.5 SPLOOP-Related Instructions
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
Unlike the SPLOOP instruction, the load of ILC is performed in parallel with the
SPLOOPD instruction. Due to the inherent latency of the load to ILC, the value to ILC
should be predecremented to account for the 4 cycle latency. The amount of the
predecrement is given in Table 8-3.
The (optional) conditional predication is used to indicate when and if a nested loop
should be reloaded. The contents of the reload inner loop counter (RILC) is copied to
ILC when either a SPKERNELR or a SPMASKR instruction is executed with the
predication condition on the SPLOOP instruction true. If the loop is not nested, then
the conditional predication should not be used.
The use of the SPLOOPD instruction can result in reducing the time spent in setting
up the loop by eliminating up to 4 cycles that would otherwise be spent in setting up
ILC. The trade-off is that the SPLOOPD instruction cannot be used if the loop is not
long enough to accommodate the 4 cycle delay.
The ii parameter is the iteration interval which specifies the interval (in instruction
cycles) between successive iterations of the loop.
The SPLOOPW instruction is used to initiate a loop buffer operation when the total
number of loops required in not known in advance. The SPLOOPW instruction must
be predicated. The loop terminates if the predication condition is true. The value in the
inner loop count register (ILC) is not used to determine the number of loops.
When the SPLOOPW instruction is used to initiate a loop buffer operation, the epilog
is skipped when the loop terminates.
The SPKERNEL(R) instruction also controls the point in the epilog that the execution
of post-SPLOOP instructions begin.
In each case, the SPKERNEL(R) instruction must be the first instruction in an execute
packet and cannot be placed in the same execute packet as any instruction that initiates
multicycle NOPs.
8-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.5 SPLOOP-Related Instructions
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
The (optional) fstg and fcyc parameters specify the delay interval between the
SPKERNEL instruction and the start of the post epilog code. The fstg specifies the
number of complete stages and the fcyc specifies the number of cycles in the last stage
in the delay.
The SPKERNEL instruction has arguments that instruct the SPLOOP hardware to
begin execution of post-SPLOOP instructions by an amount of delay (stages/cycles)
after the start of the epilog.
Note that the post-epilog instructions are fetched from program memory and overlaid
with the epilog instructions fetched from the SPLOOP buffer. Functional unit conflicts
can be avoided by either coding for a sufficient delay using the SPKERNEL instruction
arguments or by using the SPMASK instruction to inhibit the operation of instructions
from the buffer that might conflict with the instructions from the epilog.
If a reload is required with a delay between the SPKERNEL and the point of reload
(that is, nonperfect overlap) use the SPMASKR instruction with the SPKERNEL (not
SPKERNELR) to indicate the point of reload.
The SPKERNELR instruction cannot be used in the same SPLOOP operation as the
SPMASKR instruction.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-7
Submit Documentation Feedback
8.5 SPLOOP-Related Instructions
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
The SPMASKR instruction cannot be used in the same SPLOOP operation as the
SPKERNELR instruction.
SPMASKR (unitmask)
The unitmask parameter specifies which functional units are masked by the SPMASK
or SPMASKR instruction. The units may alternatively be specified by marking the
instructions with a caret (^) symbol. The following two forms are equivalent and will
each mask the .D1 unit. Example 8-1 and Example 8-2 show the two ways of specifying
the masked instructions.
8-8 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.6 Basic SPLOOP Example
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
Example 8-5 is an alternate implementation of the same loop using the SPLOOPD
instruction. The load of the inner loop count register (ILC) can be made in the same
cycle as the SPLOOPD instruction, but due to the inherent delay between loading the
ILC and its use, the value needs to be predecremented to account for the 4 cycle delay.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-9
Submit Documentation Feedback
8.6 Basic SPLOOP Example
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
Table 8-1 SPLOOP Instruction Flow for Example 8-4 and Example 8-5
Loop
Cycle 1 2 3 4 5 6 7 8
1 LDW
2 NOP LDW
3 NOP NOP LDW
4 NOP NOP NOP LDW
5 NOP NOP NOP NOP LDW
6 MV NOP NOP NOP NOP LDW
7 STW MV NOP NOP NOP NOP LDW
8 STW MV NOP NOP NOP NOP LDW
9 STW MV NOP NOP NOP NOP
10 STW MV NOP NOP NOP
11 STW MV NOP NOP
12 STW MV NOP
13 STW MV
14 STW
8-10 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.6 Basic SPLOOP Example
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
do {
I--;
dest[i]=source[i];
} while (I);
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-11
Submit Documentation Feedback
8.6 Basic SPLOOP Example
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
8-12 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.7 Loop Buffer
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
The SPLOOP body is a single, scheduled iteration of the loop. It consists of one or more
stages of ii cycles each. The execution of the prolog, kernel, and epilog are generated
from copies of this single iteration time shifted by multiples of ii cycles and overlapped
for simultaneous execution. The final stage may contain fewer than ii cycles, omitting
the final cycles if they have only NOP instructions.
The dynamic length (dynlen) is the length of the SPLOOP body in cycles starting with
the cycle after the SPLOOP(D) instruction. The dynamic length counts both execute
packets and NOP cycles, but does not count stall cycles. The loop buffer can
accommodate a SPLOOP body of up to 48 cycles.
Example 8-8 demonstrates counting of dynamic length. There are 4 cycles of NOP that
could be combined into a single NOP 4. It is split up here to be clearer about the cycle
and stage boundaries.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-13
Submit Documentation Feedback
8.7 Loop Buffer
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
In Table 8-3, the instructions in the CPU pipeline are executed from program memory.
The instructions in the SPL buffer are executed from the SPLOOP buffer. At K0 for
example, stage3 is being executed from program memory and stage0, stage1, and stage2
are being executed from the SPLOOP buffer. At Kn and later, by contrast, all stages are
being executed from the SPLOOP buffer.
Table 8-3 Software Pipeline Instruction Flow Using the Loop Buffer
Execution Flow CPU Pipeline SPL Buffer
P0 stage0 -
P1 stage1 stage0
P2 stage2 stage0 stage1
K0 stage3 stage0 stage1 stage2
Kn - stage0 stage1 stage2 stage3
E0 - - stage1 stage2 stage3
E1 - - - stage2 stage3
E2 - - - - stage3
8-14 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.7 Loop Buffer
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
The execution of a software pipeline prolog and the first kernel stage are implemented
by fetching valid instructions from the loop buffer and executing them in parallel with
instructions fetched from program memory. The instructions fetched from program
memory are loaded into the loop buffer and marked as valid on the next cycle. The
execution of the remaining kernel stages is implemented by exclusively fetching valid
instructions from the loop buffer. The execution of a software pipeline epilog is
implemented by draining the loop buffer by marking instructions as invalid, while
fetching the remaining valid instructions from the loop buffer.
For example: referring to Example 8-4 on page 8-9 and Table 8-1 on page 8-10; as each
instruction in loop 1 is reached in turn, it is fetched from program memory, executed
and stored in the loop buffer. When each instruction is reached in loop 2 through loop
12, it is fetched from the loop buffer and executed. As cycles 8 through 12 execute,
instructions in the loop buffer are marked as invalid so that for each cycle fewer
instructions are fetched from the loop buffer.
The loop buffer supports the execution of a nested software pipelined loop by
reenabling the instructions stored in the loop buffer. The reexecution of the software
pipeline prolog is implemented by reenabling instructions in the loop buffer (by
marking them as valid) and then fetching valid instructions from the loop buffer. The
point of reload for the nested loop is signaled by the SPKERNELR or SPMASKR
instruction.
The loop buffer also supports do−while type of constructs in which the number of
iterations is not known in advance, but is determined in the course of the execution of
the loop. In this case, the loop immediately completes after the last kernel stage without
executing the epilog.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-15
Submit Documentation Feedback
8.7 Loop Buffer
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
There is one case where the SPLX bit is set to 1 when the loop buffer is idle. When
executing a B IRP instruction to return to an interrupted SPLOOP, the ITSR is copied
back into TSR in the E1 stage of the branch. The SPLX bit is set to 1 beginning in the
E2 stage of the branch, which is before the loop buffer has restarted. If the loop buffer
state machine is started in the branch delay slots of a B IRP or B NRP instruction, it
uses the SPLX bit to determine if this is a restart of an interrupted SPLOOP. The SPLX
bit is not checked if starting an SPLOOP outside the delay slots of one of these branches.
When the SPKERNEL instruction is encountered, the loop is finished loading, the
dynlen is assigned the current value of the loading counter, and program memory fetch
is disabled. If the SPKERNEL is on the last kernel stage boundary, program memory
fetch may immediately be reenabled (or effectively never disabled).
SPMASKed instructions from program memory are not stored in the loop buffer. The
BNOP <displacement> instruction does not use a functional unit and cannot be
specified by the SPMASK instruction, so this instruction is treated in the same way as
an SPMASKed instruction.
When returning to an SPLOOP(D) instruction with the SPLX bit in TSR set to 1,
SPMASKed instructions from program memory execute like a NOP. The NOP cycles
associated with ADDKPC, BNOP, or protected LD instructions that are masked, are
always executed when resuming an interrupted SPLOOP(D).
The assembler will ensure that there are no resource conflicts that would occur if the
first kernel stage were actually reached.
8-16 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.7 Loop Buffer
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
Instructions fetched from the loop buffer that are masked by an SPMASK instruction
are not executed. An instruction fetched from program memory may execute on the
units that were used by an SPMASKed instruction. (See ‘‘Instruction Resource
Conflicts and SPMASK Operation’’ on page 8-40).
The draining counter is used to retrace the order in which instructions were loaded into
the loop buffer. The draining counter is initialized to 0 and then incremented by 1 each
cycle. Instructions in the loop buffer are marked as invalid in the order that they were
loaded.
Instructions in the loop buffer indexed by LBC are marked as invalid if their loading
counter value (from when they were loaded into the loop buffer) is equal to the draining
counter value.
When the draining counter is equal to (dynlen - ii), draining is complete. Any
remaining valid instructions for the loop (with a loading counter > (dynlen - ii)) are all
marked as invalid.
If the loop is interrupt draining, then program memory fetch remains disabled until the
interrupt is taken. If the loop is normal draining, program memory fetch is enabled
after a delay specified by the SPKERNEL(R) instruction.
The reloading counter is initialized to 0 and then incremented by 1 each cycle until it
equals the dynlen. The reloading counter is used to retrace the order in which
instructions were loaded into the loop buffer.
Instructions in the loop buffer indexed by LBC are marked as valid, if their loading
counter value (from when they were written into the loop buffer) is equal to the
reloading counter value.
Reloading does not have to start on a stage boundary. Reloading and draining may
access different offsets in the loop buffer. Therefore, there are two LBCs. When reload
begins, the unused LBC (the one not being used for draining) is allocated for reloading.
When the reloading counter is equal to the dynlen, the reloading of the software
pipeline loop is complete, all the original loop instructions have been reenabled, and
the reloading counter stops incrementing.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-17
Submit Documentation Feedback
8.8 Execution Patterns
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
Program memory fetch of the epilog is disabled when the reload counter equals the
dynlen or after the last delay slot of a branch that executed with a true condition. In
general, the branch is used in a nested loop to place the PC back at the address of the
execute packet after the SPKERNEL(R) instruction to reuse the same epilog code
between each execution of the inner loop.
A hardware exception is raised while reloading if the termination condition is true and
the draining counter for the previous invocation of the loop has not reached the value
of dynlen −ii. This describes a condition where both invocations of the loop are
attempting to drain at the same time (this could happen, for example, if the RILC value
was smaller than the ILC value).
In Figure 8-3, the termination condition is true on the first kernel stage boundary K0,
and falling through to the epilog, the software pipeline only executes a single kernel
stage.
Figure 8-2 General Prolog, Kernel, and Epilog Execution Pattern
Execution Loop buffer
pattern operation
P0
P1
Load
P2
K0
Fetch
Kn
E0
Drain
E1
E2
8-18 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.8 Execution Patterns
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
P0
P1
Load
P2
K0
Fetch
E0
Drain
E1
E2
If the termination condition is encountered on the first stage boundary (end of P0) as
in Figure 8-5, then no instructions actually execute from the loop buffer. In this special
case of early-exit, the loop is only executing a single iteration.
Figure 8-4 Early-Exit Execution Pattern
Execution Loop buffer
pattern operation
P0
P1
Load
P2 E0
Fetch
Drain
K0 E1
E2
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-19
Submit Documentation Feedback
8.8 Execution Patterns
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
P0
P1 E0
Load
Fetch
P2 E1
Drain
K0 E2
In Figure 8-6, the loop buffer begins executing a reload prolog while completing the
epilog of a previous invocation of the same loop.
P01
P11
Load
P21
K01
Fetch
Kn1
E01 P02
Drain
E11 P12
Load
E21 P22
K02
Fetch
Kn2
E02
Drain
E12
E22
8-20 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.8 Execution Patterns
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
P01
P11
Load
P21
K01
Fetch
Kn1
E01 P02
Drain
E11 P12
Load
E21 P22
Fetch
K02 E02
Drain
E12
E22
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-21
Submit Documentation Feedback
8.9 Loop Buffer Control Using the Unconditional SPLOOP(D) Instruction
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
For the first 3 cycles of a loop initiated by an unconditional SPLOOPD instruction, the
stage boundary termination condition is always false, ILC decrement is disabled, and
the loop cannot be interrupted.
8-22 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.9 Loop Buffer Control Using the Unconditional SPLOOP(D) Instruction
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
If the loop is interrupted and after interrupt draining is complete, ILC contains the
current number of remaining loop iterations.
Example 8-9 shows a case in which the value loaded to ILC is determined at run time.
The loop may begin draining at any point whenever the ILC decrements to zero (that
is, the loop may execute 0 or more iterations). The comments in the example show the
stage number (N), the test for termination and the conditional decrement of ILC. ILC
will not decrement below zero.
8.9.3 Using SPLOOPD for Loops with Known Minimum Iteration Counts
For loops with known iteration counts, the unconditional SPLOOPD instruction is
used to compensate for the 4-cycle latency to the assignment of ILC. The unconditional
SPLOOPD instruction differs from the SPLOOP instruction in the following ways:
• The initial termination condition test is always false and the initial ILC decrement
is disabled. The loop must execute at least one iteration.
• The stage boundary termination condition is forced to false, and ILC decrement
is disabled for the first 3 cycles of the loop.
• The loop cannot be interrupted for the first 3 cycles of the loop.
The SPLOOPD will test the SPLX bit in the TSR to determine if it is already set to one
(indicating a return from interrupt). In this case the SPLOOPD instruction executes
like an unconditional SPLOOP instruction.
The SPLOOPD instruction is used when the loop is known to execute for a minimum
number of loop iterations. The required minimum of number of iterations is a function
of ii, as shown in Table 8-4.
Table 8-4 SPLOOPD Minimum Loop Iterations
ii Minimum Number of Loop Iterations
1 4
2 2
3 2
≥4 1
When using the SPLOOPD instruction, ILC must be loaded with a value that is biased
to compensate for the required minimum number of loop iterations. As shown in
Example 8-10, for a loop with an ii equal to 1 that will execute 100 iterations, ILC is
loaded with 96.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-23
Submit Documentation Feedback
8.9 Loop Buffer Control Using the Unconditional SPLOOP(D) Instruction
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
Program memory fetch enable is delayed until a specific stage and cycle in the
execution of the epilog. The SPKERNEL instruction fstg and fcyc operands are
combined (by the assembler) to calculate the delay in instruction cycles:
delay = (fstg * ii) + fcyc
Program memory fetch is delayed until the following conditions are all true:
• The loop has reached the last kernel stage boundary
• The loop is not interrupt draining
• The draining counter has reached the delay value specified by fstg and fcyc.
Referring back to Example 8-4 on page 8-9, the program memory fetch delay is set to
start fetching after the last epilog instruction.
If the loop buffer goes to idle (for example, if the epilog is smaller than the specified
delay or if the loop early−exit execution pattern), program memory fetch is enabled and
the fetch enable delay is ignored.
If the loop is reloading and the loop executes 0 or 1 iteration, then the loop buffer
executes until the last reloading stage boundary. Between when the reloading counter
becomes equal to the dynlen and the last reloading stage boundary, the loop buffer
issues NOP instructions.
8-24 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.9 Loop Buffer Control Using the Unconditional SPLOOP(D) Instruction
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
The reload does not have to start on a stage boundary of the draining loop as indicated
by the second and third conditions above.
If the initial termination condition is false, then the value stored in RILC is extracted,
decremented and copied into ILC and normal reloading begins. The value of RILC is
unchanged.
If RILC is equal to 0 on the cycle before the reload begins, the initial termination
condition is true for the reloaded loop. If the initial termination condition is true, then
the reloaded loop invocation is skipped: the instructions in the loop buffer execute as
NOPs until the last reloading stage boundary and the reload condition is evaluated
again.
The PC remains at its current location when program memory fetch is disabled. If a
branch disabled program memory fetch, then the PC remains at the branch target
address.
Note that the first condition above is the only time that the loop buffer will not go to
idle after the last delay slot of a taken branch.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-25
Submit Documentation Feedback
8.9 Loop Buffer Control Using the Unconditional SPLOOP(D) Instruction
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
There must be at least one valid outer loop branch that will always execute with a true
condition when the loop is reloading. An outer loop branch is valid under all of the
following conditions:
• The branch always executes if the reload condition was true.
• The branch target is the execute packet after the SPKERNEL execute packet.
• The last delay slot of the branch occurs before the reloading counter equals the
dynlen. Note that this restriction implies a minimum for dynlen of 6 cycles.
There may be one or more valid post loop branch instructions that will always execute
with a false condition when the loop is reloading, and that may execute with a true
condition when the loop is not reloading.
Example 8-11 is a nested loop using the reload condition. Figure 8-8 shows the
instruction execution flow for an invocation of the inner loop, the outer loop code, and
then another inner loop. Notice that the reload starts after the first epilog stage of the
inner loop as specified by the SPMASKR instruction in the last cycle of that stage.
;*------------------------------
;* for (j=0; j<32; j++)
;* for (I=0; i<32; I++)
;* y[j] += x[i+j] * h[i]
;*------------------------------
;* x=a4, h=b4, y=a6
MVK .S2 32,B0
MVC .S2 B0,ILC ;Inner loop count
NOP 3
[B0] SPLOOP2
|| MVC .S2 B0,RILC ;Reload inner loop count
|| SUB .D2 B0,1,B0 ;Outer loop count
|| MVK .S1 62,A5 ;X delta
|| MV .L2 B4,B5 ;Copy h
8-26 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.9 Loop Buffer Control Using the Unconditional SPLOOP(D) Instruction
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-27
Submit Documentation Feedback
8.9 Loop Buffer Control Using the Unconditional SPLOOP(D) Instruction
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
If an SPLOOP (not SPLOOPD or SPLOOPW) instruction is used, then ILC is used for
the 3 cycles before the SPLOOP instruction and until the loop buffer is draining and
not reloading.
If an SPLOOPD instruction is used, then ILC is used on the first cycle after the
SPLOOPD instruction and until the loop buffer is draining and not reloading.
In general, it is an error to read or write ILC or RILC while the loop buffer is using them.
This error is enforced by the following hardware and assembler exceptions. The value
obtained by reading ILC during loading is not assured to be consistent across different
implementations, due to potential differences in timing of the decrement of the register
by the loop hardware.
8-28 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.10 Loop Buffer Control Using the SPLOOPW Instruction
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
The SPLOOPW instruction is intended to be used for do-while loops. These are loops
whose termination condition is more complex than a simple down counter by 1. In
addition, these types of loops compute the loop termination condition and exit without
executing an epilog. This technique may require over executing (or speculating) some
instructions.
The instruction that defines the termination condition register must occur at least 4
cycles before a stage boundary and at least 4 cycles before the last instruction in the
loop. If the termination condition is determined in the last loading stage, the dynlen
must be a multiple of ii. These restrictions ensure that on return from an interrupt to a
SPLOOPW instruction, the loop executes 1 or more iterations.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-29
Submit Documentation Feedback
8.10 Loop Buffer Control Using the SPLOOPW Instruction
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
Example 8-12 shows a loop with a loop counter that down counts by an unknown
value. For this loop, it must be safe to over-execute the LDH instructions 8 times.
Example 8-13 shows a string copy implementation. Figure 8-9 shows the execution
flow if the source points to a null string. In this version, it must be safe to over-execute
the LDB instruction 4 times.
;*------------------------------
;* do {
; sum += *x++ * *y++;
; n -= m;
; } while (n >= 0)
;*------------------------------
[!A1] SPLOOPW 1
|| MVK .S1 0x0,A1 ;C = false
LDH .D1T1 *A5++,A3 ;t1 = *x++
|| LDH .D2T2 *B5++,B6 ;t2 = *y++
NOP 2
SUB .L2 B4,B7,B4 ;n -=m
CMPLT .L2 B4,0,A1 ;c = n < 0 // term_cond = !A1
MPY .M1X B6,A3,A4 ;p = t1 * t2 // delay slot 1
NOP 1 // delay slot 2
ADD .L1 A4,A6,A6 ;sum += p; // delay slot 3
SPKERNEL ;if (c) break; // cycle term_cond
// used
8-30 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.11 Using the SPMASK Instruction
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
If the termination condition becomes true while interrupt draining, the action of
interrupt draining results in the under-execution of the early stages of the loop body in
comparison to the same loop when not interrupted. The loop body must be coded such
that the under-execution of the early stages of the loop body are safe.
The initial setup, the post loop operations, and adjusting the setup for the reloaded loop
are all overhead that may be minimized by moving their execution to within the same
instruction cycles as the operation of the SPLOOP.
If some setup code is required to do some initialization that is not used until late in the
loop; you can save instruction cycles by using the SPMASK instruction to overlay the
setup code with the first few cycles of the SPLOOP. The SPMASK will cause the masked
instructions to be executed once without being loaded to the SPLOOP buffer.
Example 8-14 shows how this might be done.
If the SPMASK is used in the outer loop code (that is, post epilog code), it will force the
substitution of the SPMASKed instructions in the outer loop code for the instruction
using the same functional unit in the SPLOOP buffer for the first iteration of the
reloaded inner loop. For example, if pointers need to be reset at the point that a loop is
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-31
Submit Documentation Feedback
8.11 Using the SPMASK Instruction
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
reloaded, the instructions that do the reset can be inhibited using the SPMASK
instruction so that the instructions that originally adjusted the pointers are replaced in
the execution flow with instruction in the outer loop that are marked with the
SPMASK instruction. Example 8-15 shows how this might be done.
Example 8-14 Using the SPMASK Instruction to Merge Setup Code with SPLOOPW
;*------------------------------
; dst=&(dst[n])
;* do {
; t = *src++;
; *dst++ = t;
; } while (count--)
;
;A4 = Source address
;B4 = Destination address
;A6 = Number of words to copy
;B6 = Offset into destination to do copy
;*------------------------------
[A1] SPLOOPW 1
|| ADD .L1 A6,1,A1 ;Position loop cnt to valid reg
|| SHL .S2 B6,2,B6 ;Adjust offset for size of WORD
SPMASK
||^ ADD .L2 B6,B4,B4 ;Add offset into buffer to dest
|| LDW .D1 *A4++,A0 ;Load word and inc ptr
NOP 1 ;Wait for portion of delay
[A1] SUB .S1 A1,1,A1 ;Decrement loop count
NOP 2 ;Complete necessary wait
MV .L2X A0,B0 ;Position Word for write
SPKERNEL 0,0
|| STW .D2 B0,*B4++ ;Store word
Table 8-5 SPLOOP Instruction Flow for First Three Cycles of Example 8-15
Loop
Cycle 1 2 3 Notes
0 ADD Instructions are in parallel with the SPLOOP, so they execute only once.
SHL
1 ADD The ADD is SPMASKed so it executes only once. The LDW is loaded to the
LDW SPLOOP buffer.
2 NOP LDW The ADD was not added to the SPLOOP buffer in cycle 2, so it is not executed
here.
3 SUB NOP LDW The SUB is a conditional instruction and may not execute.
4 NOP SUB NOP The SUB is a conditional instruction and may not execute.
5 NOP NOP SUB The SUB is a conditional instruction and may not execute.
6 MV NOP NOP
7 STW MV NOP
8 STW MV
9 STW
8-32 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.11 Using the SPMASK Instruction
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
8.11.2 Some Points About the SPMASK to Merge Setup Code Example
Note the following points about the execution of Example 8-15:
• The ADD and SHL instructions in the same execute packet as the SPLOOPW
instruction are only executed once. They are not loaded to the SPLOOP buffer.
• Because of the SPMASK instruction in the execute packet, the ADD in the same
execute packet as the SPMASK instruction is executed only once and is not
loaded to the SPLOOP buffer. Without the SPMASK, the ADD would conflict
with the MV instruction.
• The SHL and the 2nd ADD instructions could have been placed before the start
of the SPLOOP, but by placing the SHL in parallel with the SPLOOP instruction
and by using the SPMASK to restrict the ADD to a single execution, you have
saved a couple of instruction cycles.
Example 8-15 Using the SPMASK Instruction to Merge Reset Code with SPLOOP
;*------------------------------
; dst=&(dst[n])
;* do {
; t = *src++;
; *dst++ = t;
; } while (count--)
; adjust buffer pointers
;* do {
; t = *src++;
; *dst++ = t;
; } while (count--)
;
;A4 = 1st source address
;B4 = 1st destination address
;A6 = 2nd source address
;B6 = 2nd destination address
;A8 = number of locations to copy from each buffer
;*------------------------------
MVC A8,ILC ;Setup number of loops
MVC A8,RILC ;Reload count
MVK 1,A1 ;Reload flag
NOP 3 ;Wait for ILC load to complete
[A1] SPLOOP 1 ;Start SPLOOP with ii=1
LDW .D1 *A4++,A0 ;Load value from buffer
NOP 4 ;Wait for it to arrive
MV .L2X A0,B0 ;Move it to other side for xfer
SPKERNELR ;End of SPLOOP, immediate reload
|| STW .D2 B0,*B4++ ;...and store value to buffer
BR_TARGET:
SPMASKD1 ;Mask LDW instruction
|| [A1] B BR_TARGET ;Branch to start if post-epilog
|| [A1] SUB .S1 A1, 1, A1 ;Adjust reload flag
|| [A1] LDW .D1 *A6,A0 ;Load first word of 2nd buffer
|| [A1] ADD .L1 A6,4,A4 ;Select new source buffer
NOP 4 ;Keep in sync with SPLOOP body
OR .S2 B6,0,B4 ;Adjust destination to 2nd buffer
NOP
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-33
Submit Documentation Feedback
8.11 Using the SPMASK Instruction
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
8.11.4 Some Points About the SPMASK to Merge Reset Code Example
Note the following points about the execution of Example 8-15 (see Table 8-5 for the
instruction flow):
• The loop begins reloading from the SPLOOP buffer immediately after the
SPKERNELR instruction with no delay. In Table 8-5, the SPKERNELR is in
cycle 7 and the reload happens in cycle 8.
• Because of the SPMASK instruction, the LDW instruction in the post epilog code
replaces the LDW instruction within the loop, so that the first word copied in the
reloaded loop is from the new input buffer. The ADD instruction is used to adjust
the source buffer address for subsequent iterations within the SPLOOP body. In
Table 8-5, this happens in loop 8. Note that the D1 operand in the SPMASK
instruction indicates that the SPMASK applies to the .D1 unit. This could have
been indicated by marking the LDW instruction with a caret (^) instead.
• The OR instructions are used to adjust the destination address. It is positioned in
the post-epilog code as the MV instruction is within the SPLOOP body so that it
will not corrupt the data from the STW instructions within the SPLOOP epilog
still executing from before the reload. In Table 8-5, this happens in cycle 13 (loop
8).
• The B instruction is used to reset the program counter to the start of the epilog
between executions of the inner loop.
8.11.5 Returning from an Interrupt
When an SPLOOP is piping up after returning from an interrupt, the SPMASKed
instructions coming from the buffer are executed and instructions coming from
program memory are not executed.
8-34 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.12 Program Memory Fetch Control
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
When the loop buffer is active and under certain conditions as described below,
instruction execution from program memory is suspended. When this occurs,
instructions are only fetched and executed from the loop buffer and the PC is
unchanged.
If program memory fetch is disabled on the last loading or reloading stage boundary,
the stage boundary termination condition is true, and the program memory fetch
enable delay has completed, then program memory fetch is not disabled.
Program memory fetch remains disabled while interrupt draining or until a specific
stage and cycle during noninterrupt draining as determined by the program fetch
enable delay operand of the SPKERNEL instruction.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-35
Submit Documentation Feedback
8.13 Interrupts
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
8.13 Interrupts
When an SPLOOP(D/W) instruction is encountered, the address of the execute packet
containing the SPLOOP(D/W) instruction is recorded. If the loop buffer is
interrupted, the address stored in the interrupt return pointer register (IRP) is the
address of the execute packet containing the SPLOOP(D/W) instruction.
When the loop is finished draining and all pending register writes are complete the
interrupt is taken. This means that the interrupt latency has increased by the number
of instruction cycles in the epilog compared to the non−SPLOOP case.
The above conditions mean SPLOOP loops starting initial execution or starting reload
with ILC < = (ceil(dynlen / ii) + 3) are not interruptible because there are not enough
kernel stages to allow an interrupt to be taken without violating the last requirement.
Program memory fetch is disabled when interrupt draining. When the draining is
finished, the address of the execute packet that contains the SPLOOP instruction is
stored in IRP or NRP, and TSR is copied to ITSR or NTSR. The SPLX bit in TSR is
cleared to 0. The SPLX bit in ITSR or NTSR is set to 1.
Interrupt service routines must save and restore the ITSR or NTSR, ILC, and RILC
registers. A B IRP instruction copies ITSR to TSR, and a B NRP restores TSR from
NTSR. The value of the SPLX bit in ITSR or NTSR when the return branch is executed
is used to alter the behavior of SPLOOP(D/W) when it is restarted upon returning
from the interrupt.
8-36 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.13 Interrupts
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
8.13.3 Exceptions
If an internal or external exception occurs while the loop buffer is active, then the
following occur:
• The exception is recognized immediately and the loop buffer becomes idle.
• The loop buffer does not execute an epilog to drain the currently executing loop.
• TSR is copied into NTSR with the SPLX bit set to 1 in NTSR and cleared to 0 in
TSR.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-37
Submit Documentation Feedback
8.13 Interrupts
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
8-38 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.14 Branch Instructions
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
If a branch is taken and the loop buffer is not reloading, the loop buffer becomes idle,
and execution continues from the branch target address.
If a branch executes with a false condition (the branch is not taken), the execution of
the SPLOOP(D/W) instruction is unaffected by the presence of the untaken branch
except that interrupts are blocked during the delay slots of the branch.
This behavior allows the code in Example 8-16 to run as you expect, branching around
the loop if the condition is false before beginning.
If a branch is taken anytime while the loop buffer is active, except when in reloading,
the loop buffer goes to idle, and execution continues from the branch target address. If
a branch is taken while reloading, the PC is assigned the branch target and program
memory fetch is disabled.
[!A0] B around
||
MVC A0,ILC
NOP 3
SPLOOPii
; loop body
. . .
; end of loop body
around:
; code following loop
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-39
Submit Documentation Feedback
8.15 Instruction Resource Conflicts and SPMASK Operation
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
In the case of any conflict, an SPMASK(R) instruction must be present specifying all
units having conflicts. SPMASKed units for that cycle:
• Disable execution of any loop buffer instructions: BP, BE, or both.
• Execute a PM instruction, if present, with no effect on the buffer contents.
The only special behavior is in the case of restarting SPLOOP(D/W) after return from
interrupt. In this case, during loading SPMASKed units:
• Do not disable execution of any loop buffer instructions.
• Do not execute a present PM instruction.
If an SPMASK instruction is encountered when the loop buffer is idle or not loading
or not draining, the SPMASK instruction executes as a NOP.
Stall detection is one critical speed path in the CPU design. Adding to that path for the
case where instructions are coming from the loop buffer is undesirable and
unnecessary. There are no compelling cases where you would want to schedule a stall
within the loop body. In fact, the compiler works to ensure this does not happen. For
these reasons, the CPU will not stall for instructions coming from the loop buffer that
read/use values written on the previous cycle that require a stall for correct behavior.
8-40 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
8.16 Restrictions on Crosspath Stalls
www.ti.com Chapter 8—Software Pipelined Loop (SPLOOP) Buffer
In the event that a case occurs where a stall is required for correct operation but did not
occur, an internal exception is generated. This internal exception sets the LBX and
MSX bits in the internal exception report register (IERR), indicating a missed stall with
loop buffer operation. The exception is only generated in the event that the stall is
actually required.
There is one special case that causes an unnecessary stall in normal operation and can
be generated by the compiler. It is the case where the two instructions involved in the
stall detection are predicated on opposite conditions. This means only one of the
instructions actually executes and a stall was not required for correct behavior. Since
the stall detection is earlier in the pipeline, the decision to stall must be made before it
is known whether the instructions execute. Thus a stall is caused, even though it later
turns out not to be needed. In this case, the lack of detection for the instruction coming
from the loop buffer does not cause incorrect behavior. This allows the compiler to
continue to generate code using this case that can result in improved scheduling and
performance. The internal exception is not generated in this case.
It is possible for the assembly language programmer to place an instruction in the delay
slots of a branch to an SPLOOP that causes a pipelined write to happen while the loop
buffer is active. It is also possible for the assembly language programmer to predicate
the write and reads with different predicate values that are not mutually exclusive. The
assembler cannot prevent these cases from occurring; if they do the internal exception
will occur.
The NOP, NOP n, and BNOP instructions are the only unitless instructions allowed to
be used in an SPLOOP(D/W) body. The assembler disallows the use of any other
unitless instruction in the loop body.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 8-41
Submit Documentation Feedback
8.18 Restrictions on Instructions Placed in the Loop Buffer
Chapter 8—Software Pipelined Loop (SPLOOP) Buffer www.ti.com
8-42 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Chapter 9
CPU Privilege
9.1 Overview
The CPU includes support for a form of protected-mode operation with a two-level
system of privileged program execution.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 9-1
Submit Documentation Feedback
9.2 Execution Modes
Chapter 9—CPU Privilege www.ti.com
9-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
9.2 Execution Modes
www.ti.com Chapter 9—CPU Privilege
All bits in these registers can be read in User mode; however, only certain bits in these
registers can be written while in User mode. Writes to these restricted bits have no
effect. Since access to some bits is allowed, there is no exception caused by access to
these registers.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 9-3
Submit Documentation Feedback
9.3 Interrupts and Exception Handling
Chapter 9—CPU Privilege www.ti.com
The interrupt handler begins executing at the address formed by adding the offset for
the particular interrupt event to the value of the interrupt service table pointer register
(ISTP). The return from interrupt (B IRP) instruction restores the saved values from
ITSR into TSR, causing execution to resume in the execution mode of the interrupted
program.
The transition to the restored execution mode is coincident to the execution of the
return branch target. Execution of instructions in the delay slot of the branch are in
Supervisor mode.
The exception handler begins executing at the address formed by adding the offset for
the exception/NMI event to the value of the interrupt service table pointer register
(ISTP). The return from exception (B NRP) instruction restores the saved values from
NTSR into TSR.
9-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
9.4 Operating System Entry
www.ti.com Chapter 9—CPU Privilege
There is one potential problem with allowing direct calling into the operating system:
the caller can chose where to enter the OS and, if allowed to choose any OS location to
enter, can:
• Bypass operand checking by OS routines
• Access undocumented interfaces
• Defeat protection
• Corrupt OS data structures by bypassing consistency checks or locking
In short, allowing unrestricted entry into an OS is a very bad idea. Instead, you need to
give a very controlled way of entering the operating system and switching from User
mode to Supervisor mode. The mechanism chosen is essentially an exception, where
the handler decodes the requested operation and dispatches to a Supervisor mode
routine that validates the arguments and services the request.
When returning from an interrupt or exception, the IRP or NRP should already have
the correct return address and the ITSR.CXM or NTSR.CXM bit should already be set
to 1.
When spawning a user mode task, the appropriate CXM bit and the IRP or NRP will
need to be initialized explicitly with the entry point address of the User mode task. In
addition, the restricted entry point address register (REP) should be loaded with the
desired return address that the User mode task will use when it terminates.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide 9-5
Submit Documentation Feedback
9.4 Operating System Entry
Chapter 9—CPU Privilege www.ti.com
The User mode task can force a change to Supervisor mode by forcing an exception by
executing either an SWE or SWENR instruction.
The SWE and SWENR instructions both force a software exception. The SWE
instruction is used when a return from the exception back to the point of the exception
is desired. The SWENR instruction is used when a return to the User mode routine is
not desired.
Execution of an SWE instruction results in an exception being taken before the next
execute packet is processed. The return pointer stored in the nonmaskable interrupt
return pointer register (NRP) points to this unprocessed packet. The value of the task
state register (TSR) is copied to the NMI/exception task state register (NTSR) at the end
of the cycle containing the SWE instruction, and the interrupt/exception default value
is written to TSR. The SWE instruction should not be placed in the delay slots of a
branch since all instructions behind the SWE instruction in the pipe are annulled. All
writes to registers in the pipe from instructions executed before and in parallel with the
SWE instruction will complete before execution of the exception service routine,
therefore, the instructions prior to the SWE will complete (along with all their delay
slots) before the instructions after the SWE.
If the SWE instruction is executed while in User mode, the mode is changed to
Supervisor mode as part of the exception servicing process. The TSR is copied to NTSR,
the return address is placed in the NRP register, and a transfer of control is forced to
the NMI/Exception vector pointed to by current value of the ISTP. Any code necessary
to interpret a User mode request should reside in the exception service routine. After
processing the request the exception handler will return control to the user task by
executing a B NRP command.
The SWENR instruction can also be used to terminate a user mode task. The SWENR
instruction is similar to the SWE instruction except that no provision is made for
returning to the user mode task and the transfer of control is to the address pointed to
by REP instead of the NMI/exception vector. The supervisor mode should have earlier
placed the correct address in REP.
9-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Appendix A
Instruction Compatibility
Table A-1 lists the instructions that are common to the C62x, C64x, C64x+, C67x,
C67x+, C674x, and C66x DSPs.
Table A-1 Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs (Part 1 of 8)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP C66x DSP
ABS ✓ ✓ ✓ ✓ ✓ ✓ ✓
ABS2 ✓ ✓ ✓ ✓
ABSDP ✓ ✓ ✓ ✓
ABSSP ✓ ✓ ✓ ✓
ADD ✓ ✓ ✓1 ✓ ✓ ✓ ✓
ADDAB ✓ ✓ ✓ ✓ ✓ ✓ ✓
ADDAD ✓ ✓ ✓ ✓ ✓ ✓
ADDAH ✓ ✓ ✓ ✓ ✓ ✓ ✓
ADDAW ✓ ✓ ✓1 ✓ ✓ ✓ ✓
ADDDP ✓ ✓ ✓ ✓
ADDK ✓ ✓ ✓1 ✓ ✓ ✓ ✓
ADDKPC ✓ ✓ ✓ ✓
ADDSP ✓ ✓ ✓ ✓
ADDSUB ✓ ✓ ✓
ADDSUB2 ✓ ✓ ✓
ADDU ✓ ✓ ✓ ✓ ✓ ✓ ✓
ADD2 ✓ ✓ ✓ ✓ ✓ ✓ ✓
ADD4 ✓ ✓ ✓ ✓
AND ✓ ✓ ✓1 ✓ ✓ ✓ ✓
ANDN ✓ ✓ ✓ ✓
AVG2 ✓ ✓ ✓ ✓
AVGU4 ✓ ✓ ✓ ✓
B displacement ✓ ✓ ✓ ✓ ✓ ✓ ✓
B register ✓ ✓ ✓ ✓ ✓ ✓ ✓
B IRP ✓ ✓ ✓ ✓ ✓ ✓ ✓
B NRP ✓ ✓ ✓ ✓ ✓ ✓ ✓
BDEC ✓ ✓ ✓ ✓ ✓
BITC4 ✓ ✓ ✓ ✓
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide A-1
Submit Documentation Feedback
Appendix A—Instruction Compatibility www.ti.com
Table A-1 Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs (Part 2 of 8)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP C66x DSP
BITR ✓ ✓✑ ✓ ✓
BNOP displacement ✓ ✓1 ✓ ✓
BNOP register ✓ ✓ ✓ ✓
BPOS ✓ ✓ ✓ ✓
CALLP ✓1 ✓ ✓
CCMATMPY ✓
CCMATMPYR1 ✓
CCMPY32R1 ✓
CLR ✓ ✓ ✓1 ✓ ✓ ✓ ✓
CMATMPY ✓
CMATMPYR1 ✓
CMPEQ ✓ ✓ ✓1 ✓ ✓ ✓ ✓
CMPEQ2 ✓ ✓ ✓ ✓
CMPEQ4 ✓ ✓ ✓ ✓
CMPEQDP ✓ ✓ ✓ ✓
CMPEQSP ✓ ✓ ✓ ✓
CMPGT ✓ ✓ ✓1 ✓ ✓ ✓ ✓
CMPGT2 ✓ ✓ ✓ ✓
CMPGTDP ✓ ✓ ✓ ✓
CMPGTSP ✓ ✓ ✓ ✓
CMPGTU ✓ ✓ ✓1 ✓ ✓ ✓ ✓
CMPGTU4 ✓ ✓ ✓ ✓
CMPLT ✓ ✓ ✓1 ✓ ✓ ✓ ✓
CMPLT2 ✓ ✓ ✓ ✓
CMPLTDP ✓ ✓ ✓ ✓
CMPLTSP ✓ ✓ ✓ ✓
CMPLTU ✓ ✓ ✓1 ✓ ✓ ✓ ✓
CMPLTU4 ✓ ✓ ✓ ✓
CMPY ✓ ✓ ✓
CMPY32R1 ✓
CMPYR ✓ ✓ ✓
CMPYR1 ✓ ✓ ✓
CMPYSP ✓
CROT270 ✓
CROT90 ✓
DADD ✓
DADD2 ✓
DADDSP ✓
DAPYS2 ✓
DAVG2 ✓
DAVGNR2 ✓
DAVGNRU4 ✓
DAVGU4 ✓
DCCMPY ✓
A-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com Appendix A—Instruction Compatibility
Table A-1 Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs (Part 3 of 8)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP C66x DSP
DCCMPYR1 ✓
DCMPEQ2 ✓
DCMPEQ4 ✓
DCMPGT2 ✓
DCMPGTU4 ✓
DCMPY ✓
DCMPYR1 ✓
DCROT270 ✓
DCROT90 ✓
DDOTP4 ✓ ✓ ✓
DDOTP4H ✓ ✓ ✓
DDOTPH2 ✓ ✓ ✓
DDOTPH2R ✓ ✓ ✓
DDOTPL2 ✓ ✓ ✓
DDOTPL2R ✓ ✓ ✓
DDOTPSU4H ✓ ✓ ✓
DEAL ✓ ✓ ✓ ✓
DINT ✓ ✓ ✓
DINTHSP (16-bit) ✓
DINTHSP (32-bit) ✓
DINTHSPU ✓
DINTSPU
DMAX2 ✓
DMAXU4 ✓
DMIN2 ✓
DMINU4 ✓
DMPY2 ✓
DMPYSP ✓
DMPYSU4 ✓
DMPYU2 ✓
DMPYU4 ✓
DMV ✓ ✓
DMVD ✓
DOTP2 ✓ ✓ ✓ ✓
DOTP4H ✓ ✓ ✓ ✓
DOTPN2 ✓ ✓ ✓ ✓
DOTPNRSU2 ✓ ✓ ✓ ✓
DOTPNRUS2 ✓ ✓ ✓ ✓
DOTPRSU2 ✓ ✓ ✓ ✓
DOTPRUS2 ✓ ✓ ✓ ✓
DOTPSU4 ✓ ✓ ✓ ✓
DOTPSU4H ✓ ✓ ✓ ✓
DOTPUS4 ✓ ✓ ✓ ✓
DOTPU4 ✓ ✓ ✓ ✓
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide A-3
Submit Documentation Feedback
Appendix A—Instruction Compatibility www.ti.com
Table A-1 Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs (Part 4 of 8)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP C66x DSP
DPACK2 ✓ ✓ ✓
DPACKH2 ✓
DPACKH4 ✓
DPACKHL2 ✓
DPACKL2 ✓
DPACKL4 ✓
DPACKLH2 ✓
DPACKLH4 ✓
DPACKX2 ✓ ✓ ✓
DPINT ✓ ✓ ✓ ✓
DPSP ✓ ✓ ✓ ✓
DPTRUNC ✓ ✓ ✓ ✓
DSADD ✓
DSADD2 ✓
DSHL ✓
DSHL2 ✓
DSHR ✓
DSHR2 ✓
DSHRU ✓
DSHRU2 ✓
DSMPY2 ✓
DSPACKU4 ✓
DSPINT ✓
DSPINTH ✓
DSSUB ✓
DSSUB2 ✓
DSUB ✓
DSUB2 ✓
DSUBSP ✓
DXPND2 ✓
DXPND4 ✓
EXT ✓ ✓ ✓1 ✓ ✓ ✓ ✓
EXTU ✓ ✓ ✓1 ✓ ✓ ✓ ✓
FADDDP ✓
FADDSP ✓
FMPYDP ✓
FSUBDP ✓
FSUBSP ✓
GMPY ✓ ✓ ✓
GMPY4 ✓ ✓ ✓ ✓
IDLE ✓ ✓ ✓ ✓ ✓ ✓ ✓
INTDP ✓ ✓ ✓ ✓
INTDPU ✓ ✓ ✓ ✓
INTSP ✓ ✓ ✓ ✓
A-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com Appendix A—Instruction Compatibility
Table A-1 Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs (Part 5 of 8)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP C66x DSP
INTSPU ✓ ✓ ✓ ✓
LAND ✓
LANDN ✓
LDB and LDB(U) ✓ ✓ ✓1 ✓ ✓ ✓ ✓
LDB and LDB(U) (15-bit offset) ✓ ✓ ✓1 ✓ ✓ ✓ ✓
LDDW ✓ ✓1 ✓ ✓ ✓ ✓
LDH and LDH(U) ✓ ✓ ✓1 ✓ ✓ ✓ ✓
LDH and LDH(U) (15-bit offset) ✓ ✓ ✓ ✓ ✓ ✓ ✓
LDNDW ✓ ✓1 ✓ ✓
LDNW ✓ ✓1 ✓ ✓
LDW ✓ ✓ ✓1 ✓ ✓ ✓ ✓
LDW (15-bit offset) ✓ ✓ ✓ ✓ ✓ ✓ ✓
LMBD ✓ ✓ ✓ ✓ ✓ ✓ ✓
LOR ✓
MAX2 ✓ ✓ ✓ ✓
MAXU4 ✓ ✓ ✓ ✓
MFENCE ✓
MIN2 ✓ ✓ ✓ ✓
MINU4 ✓ ✓ ✓ ✓
MPY ✓ ✓ ✓1 ✓ ✓ ✓ ✓
MPY2 ✓ ✓ ✓ ✓
MPY2IR ✓ ✓ ✓
MPY32 (32-bit result) ✓ ✓ ✓
MPY32 (64-bit result) ✓ ✓ ✓
MPY32SU ✓ ✓ ✓
MPY32U ✓ ✓ ✓
MPY32US ✓ ✓ ✓
MPYDP ✓ ✓ ✓ ✓
MPYH ✓ ✓ ✓1 ✓ ✓ ✓ ✓
MPYHI ✓ ✓ ✓ ✓
MPYHIR ✓ ✓ ✓ ✓
MPYHL ✓ ✓ ✓1 ✓ ✓ ✓ ✓
MPYHLU ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYHSLU ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYHSU ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYHU ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYHULS ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYHUS ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYI ✓ ✓ ✓ ✓
MPYID ✓ ✓ ✓ ✓
MPYIH ✓ ✓ ✓ ✓
MPYIHR ✓ ✓ ✓ ✓
MPYIL ✓ ✓ ✓ ✓
MPYILR ✓ ✓ ✓ ✓
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide A-5
Submit Documentation Feedback
Appendix A—Instruction Compatibility www.ti.com
Table A-1 Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs (Part 6 of 8)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP C66x DSP
MPYLH ✓ ✓ ✓1 ✓ ✓ ✓ ✓
MPYLHU ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYLI ✓ ✓ ✓ ✓
MPYLIR ✓ ✓ ✓ ✓
MPYLSHU ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYLUHS ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYSP ✓ ✓ ✓ ✓
MPYSPDP ✓ ✓ ✓ ✓
MPYSP2DP ✓ ✓ ✓ ✓
MPYSU ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYSU4 ✓ ✓ ✓ ✓
MPYU ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYU2 ✓
MPYU4 ✓ ✓ ✓ ✓
MPYUS ✓ ✓ ✓ ✓ ✓ ✓ ✓
MPYUS4 ✓ ✓ ✓ ✓
MV ✓ ✓ ✓1 ✓ ✓ ✓ ✓
MVC ✓ ✓ ✓1 ✓ ✓ ✓ ✓
MVD ✓ ✓ ✓ ✓
MVK ✓ ✓ ✓1 ✓ ✓ ✓ ✓
MVKH/MVKLH ✓ ✓ ✓ ✓ ✓ ✓ ✓
MVKL ✓ ✓ ✓ ✓ ✓ ✓ ✓
NEG ✓ ✓ ✓1 ✓ ✓ ✓ ✓
NOP ✓ ✓ ✓1 ✓ ✓ ✓ ✓
NORM ✓ ✓ ✓ ✓ ✓ ✓ ✓
NOT ✓ ✓ ✓ ✓ ✓ ✓ ✓
OR ✓ ✓ ✓1 ✓ ✓ ✓ ✓
PACK2 ✓ ✓ ✓ ✓
PACKH2 ✓ ✓ ✓ ✓
PACKH4 ✓ ✓ ✓ ✓
PACKHL2 ✓ ✓ ✓ ✓
PACKLH2 ✓ ✓ ✓ ✓
PACKL4 ✓ ✓ ✓ ✓
QMPY32 ✓
QMPYSP ✓
QSMPY32R1 ✓
RCPDP ✓ ✓ ✓ ✓
RCPSP ✓ ✓ ✓ ✓
RINT ✓ ✓ ✓
ROTL ✓ ✓ ✓ ✓
RPACK2 ✓ ✓ ✓
RSQRDP ✓ ✓ ✓ ✓
RSQRSP ✓ ✓ ✓ ✓
SADD ✓ ✓ ✓1 ✓ ✓ ✓ ✓
A-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com Appendix A—Instruction Compatibility
Table A-1 Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs (Part 7 of 8)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP C66x DSP
SADD2 ✓ ✓ ✓ ✓
SADDSUB ✓ ✓ ✓
SADDSUB2 ✓ ✓ ✓
SADDSU2 ✓ ✓ ✓ ✓
SADDUS2 ✓ ✓ ✓ ✓
SADDU4 ✓ ✓ ✓ ✓
SAT ✓ ✓ ✓ ✓ ✓ ✓ ✓
SET ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SHFL ✓ ✓ ✓ ✓
SHFL3 ✓ ✓ ✓
SHL ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SHL2 ✓
SHLMB ✓ ✓ ✓ ✓
SHR ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SHR2 ✓ ✓ ✓ ✓
SHRMB ✓ ✓ ✓ ✓
SHRU ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SHRU2 ✓ ✓ ✓ ✓
SMPY ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SMPYH ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SMPYHL ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SMPYLH ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SMPY2 ✓ ✓ ✓ ✓
SMPY32 ✓ ✓ ✓
SPACK2 ✓ ✓ ✓ ✓
SPACKU4 ✓ ✓ ✓ ✓
SPDP ✓ ✓ ✓ ✓
SPINT ✓ ✓ ✓ ✓
SPKERNEL ✓1 ✓ ✓
SPKERNELR ✓ ✓ ✓
SPLOOP ✓1 ✓ ✓
SPLOOPD ✓1 ✓ ✓
SPLOOPW ✓ ✓ ✓
SPMASK ✓1 ✓ ✓
SPMASKR ✓1 ✓ ✓
SPTRUNC ✓ ✓ ✓ ✓
SSHL ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SSHVL ✓ ✓ ✓ ✓
SSHVR ✓ ✓ ✓ ✓
SSUB ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SSUB2 ✓ ✓ ✓
STB ✓ ✓ ✓1 ✓ ✓ ✓ ✓
STB (15-bit offset) ✓ ✓ ✓ ✓ ✓ ✓ ✓
STDW ✓ ✓1 ✓ ✓
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide A-7
Submit Documentation Feedback
Appendix A—Instruction Compatibility www.ti.com
Table A-1 Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs (Part 8 of 8)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP C66x DSP
STH ✓ ✓ ✓1 ✓ ✓ ✓ ✓
STH (15-bit offset) ✓ ✓ ✓ ✓ ✓ ✓ ✓
STNDW ✓ ✓1 ✓ ✓
STNW ✓ ✓1 ✓ ✓
STW ✓ ✓ ✓1 ✓ ✓ ✓ ✓
STW (15-bit offset) ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SUB ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SUBAB ✓ ✓ ✓ ✓ ✓ ✓ ✓
SUBABS4 ✓ ✓ ✓ ✓
SUBAH ✓ ✓ ✓ ✓ ✓ ✓ ✓
SUBAW ✓ ✓ ✓1 ✓ ✓ ✓ ✓
SUBC ✓ ✓ ✓ ✓ ✓ ✓ ✓
SUBDP ✓ ✓ ✓ ✓
SUBSP ✓ ✓ ✓ ✓
SUBU ✓ ✓ ✓ ✓ ✓ ✓ ✓
SUB2 ✓ ✓ ✓ ✓ ✓ ✓ ✓
SUB4 ✓ ✓ ✓ ✓
SWAP2 ✓ ✓ ✓ ✓
SWAP4 ✓ ✓ ✓ ✓
SWE ✓ ✓ ✓
SWENR ✓ ✓ ✓
UNPKBU4 ✓
UNPKH2 ✓
UNPKHU2 ✓
UNPKHU4 ✓ ✓ ✓ ✓
UNPKLU4 ✓ ✓ ✓ ✓
XOR ✓ ✓ ✓1 ✓ ✓ ✓ ✓
XORMPY ✓ ✓ ✓
XPND2 ✓ ✓ ✓ ✓
XPND4 ✓ ✓ ✓ ✓
ZERO ✓ ✓ ✓ ✓ ✓ ✓ ✓
1. Instruction also available in compact form. See section Section 3.10 on page 3-29.
A-8 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Appendix B
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide B-1
Submit Documentation Feedback
Appendix B—Mapping Between Instruction and Functional Unit www.ti.com
B-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com Appendix B—Mapping Between Instruction and Functional Unit
DMAX2 ✓
DMAXU4 ✓
DMIN2 ✓
DMINU4 ✓
DMPY2 ✓
DMPYSP ✓
DMPYSU4 ✓
DMPYU2 ✓
DMPYU4 ✓
DMV ✓ ✓
DMVD ✓ ✓
DOTP2 ✓
DOTP4H ✓
DOTPN2 ✓
DOTPNRSU2 ✓
DOTPNRUS2 ✓
DOTPRSU2 ✓
DOTPRUS2 ✓
DOTPSU4 ✓
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide B-3
Submit Documentation Feedback
Appendix B—Mapping Between Instruction and Functional Unit www.ti.com
B-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com Appendix B—Mapping Between Instruction and Functional Unit
INTDPU ✓
INTSP ✓ ✓
INTSPU ✓ ✓
LAND ✓
LANDN ✓
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide B-5
Submit Documentation Feedback
Appendix B—Mapping Between Instruction and Functional Unit www.ti.com
B-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
www.ti.com Appendix B—Mapping Between Instruction and Functional Unit
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide B-7
Submit Documentation Feedback
Appendix B—Mapping Between Instruction and Functional Unit www.ti.com
B-8 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Appendix C
This appendix lists the instructions that execute in the .D functional unit and illustrates
the opcode maps for these instructions.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide C-1
Submit Documentation Feedback
C.2 Opcode Map Symbols and Meanings
Appendix C—.D Unit Instructions and Opcode Maps www.ti.com
x cross path for src2; 0 = do not use cross path, 1 = use cross path
y .D1 or .D2 unit; 0 = .D1 unit, 1 = .D2 unit
z test for equality with zero or nonzero
C-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
C.3 32-Bit Opcode Maps
www.ti.com Appendix C—.D Unit Instructions and Opcode Maps
13 12 7 6 5 4 3 2 1 0
src1 op 1 0 0 0 0 s p
5 6 1 1
3 1 5 5 5
13 12 11 10 9 6 5 4 3 2 1 0
src1 x 1 0 op 1 1 0 0 s p
5 1 4 1 1
3 1 5 5 5
13 12 9 8 7 6 4 3 2 1 0
offsetR mode r y op 0 1 s p
5 4 1 1 3 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide C-3
Submit Documentation Feedback
C.4 16-Bit Opcode Maps
Appendix C—.D Unit Instructions and Opcode Maps www.ti.com
3 1 5 5 5
13 12 9 8 7 6 5 4 3 2 1 0
offsetR mode 1 y 0 1 0 0 1 s p
5 4 1 1 1
3 1 5 5 5
13 12 9 8 7 6 5 4 3 2 1 0
offsetR mode 1 y 1 1 1 0 1 s p
5 4 1 1 1
8 7 6 4 3 2 1 0
offsetR y op 1 1 s p
15 1 3 1 1
0 0 0 1 dst offsetR
5 15
8 7 6 4 3 2 1 0
offsetR y op 1 1 s p
15 1 3 1 1
C-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
C.4 16-Bit Opcode Maps
www.ti.com Appendix C—.D Unit Instructions and Opcode Maps
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide C-5
Submit Documentation Feedback
C.4 16-Bit Opcode Maps
Appendix C—.D Unit Instructions and Opcode Maps www.ti.com
C-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
C.4 16-Bit Opcode Maps
www.ti.com Appendix C—.D Unit Instructions and Opcode Maps
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide C-7
Submit Documentation Feedback
C.4 16-Bit Opcode Maps
Appendix C—.D Unit Instructions and Opcode Maps www.ti.com
ld/st Mnemonic
0 STW (.unit) src,*B15[ucst5]
1 LDW (.unit)*B15[ucst5], dst
C-8 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
C.4 16-Bit Opcode Maps
www.ti.com Appendix C—.D Unit Instructions and Opcode Maps
op Mnemonic
0 ADD (.unit) src1, src2, dst (src1 = dst)
1 SUB (.unit) src1, src2, dst (src1 = dst, dst = src1 - src2
Mnemonic
ADDAW (.unit)B15, ucst5, dst
op Mnemonic
0 ADDAW (.unit)B15, ucst5, B15
1 SUBAW (.unit)B15, ucst5, B15
op Mnemonic
0 0 0 see LSDx1, Figure G-4
0 0 1 see LSDx1, Figure G-4
0 1 0 Reserved
0 1 1 SUB (.unit) src2, 1, dst (src2 = dst, dst = src2 - 1)
1 0 0 Reserved
1 0 1 see LSDx1, Figure G-4
1 1 0 Reserved
1 1 1 see LSDx1, Figure G-4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide C-9
Submit Documentation Feedback
C.4 16-Bit Opcode Maps
Appendix C—.D Unit Instructions and Opcode Maps www.ti.com
dw ld/st Mnemonic
0 0 STW (.unit) src,*B15--[ucst2]
0 1 LDW (.unit)*++B15[ucst2], dst
1 0 STDW (.unit) src,*B15--[ucst2]
1 1 LDDW (.unit)*++B15[ucst2], dst
C-10 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Appendix D
This appendix lists the instructions that execute in the .L functional unit and illustrates
the opcode maps for these instructions.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide D-1
Submit Documentation Feedback
D.3 32-Bit Opcode Maps
Appendix D—.L Unit Instructions and Opcode Maps www.ti.com
sn sign
src1 source 1
src2 source 2
ucstn n-bit unsigned constant field
x cross path for src2; 0 = do not use cross path, 1 = use cross path
z test for equality with zero or nonzero
3 1 5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
5 5 5
13 12 11 5 4 3 2 1 0
src1 x op 1 1 0 s p
5 1 7 1 1
D-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
D.4 16-Bit Opcode Maps
www.ti.com Appendix D—.L Unit Instructions and Opcode Maps
3 1 5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
op x 0 0 1 1 0 1 0 1 1 0 s p
5 1 1 1
op SAT Mnemonic
0 0 ADD (.unit) src1, src2, dst
0 1 SADD (.unit) src1, src2, dst
1 0 SUB (.unit) src1, src2, dst (dst = src1 - src2)
1 1 SSUB (.unit) src1, src2, dst (dst = src1 - src2)
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide D-3
Submit Documentation Feedback
D.4 16-Bit Opcode Maps
Appendix D—.L Unit Instructions and Opcode Maps www.ti.com
Mnemonic
ADD (.unit) scst5, src2, dst
op Mnemonic
0 0 0 AND (.unit) src1, src2, dst
0 0 1 OR (.unit) src1, src2, dst
0 1 0 XOR (.unit) src1, src2, dst
0 1 1 CMPEQ (.unit) src1, src2, dst
1 0 0 CMPLT (.unit) src1, src2, dst (dst = src1 < src2 , signed compare)
1 0 1 CMPGT (.unit) src1, src2, dst (dst = src1 > src2 , signed compare)
1 1 0 CMPLTU (.unit) src1, src2, dst (dst = src1 < src2 , unsigned compare)
1 1 1 CMPGTU (.unit) src1, src2, dst (dst = src1 > src2 , unsigned compare)
D-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
D.4 16-Bit Opcode Maps
www.ti.com Appendix D—.L Unit Instructions and Opcode Maps
Mnemonic
MVK (.unit) scst5, dst
Mnemonic
CMPEQ (.unit) ucst3, src2, dst
op Mnemonic
0 0 CMPLT (.unit) ucst1, src2, dst (dst = ucst1 < src2 , signed compare)
0 1 CMPGT (.unit) ucst1, src2, dst (dst = ucst1 > src2 , signed compare)
1 0 CMPLTU (.unit) ucst1, src2, dst (dst = ucst1 < src2 , unsigned compare)
1 1 CMPGTU (.unit) ucst1, src2, dst (dst = ucst1 > src2 , unsigned compare)
op Mnemonic
0 0 0 see LSDx1, Figure G-4
0 0 1 see LSDx1, Figure G-4
0 1 0 SUB (.unit)0, src2, dst (src2 = dst; dst = 0 - src2)
0 1 1 ADD (.unit)-1, src2, dst (src2 = dst)
1 0 0 Reserved
1 0 1 see LSDx1, Figure G-4
1 1 0 Reserved
1 1 1 see LSDx1, Figure G-4
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide D-5
Submit Documentation Feedback
D.4 16-Bit Opcode Maps
Appendix D—.L Unit Instructions and Opcode Maps www.ti.com
D-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Appendix E
This appendix lists the instructions that execute in the .M functional unit and illustrates
the opcode maps for these instructions.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide E-1
Submit Documentation Feedback
E.2 Opcode Map Symbols and Meanings
Appendix E—.M Unit Instructions and Opcode Maps www.ti.com
13 12 11 10 6 5 4 3 2 1 0
src1 x 0 op 1 1 0 0 s p
5 1 5 1 1
5 5 5
13 12 11 10 6 5 4 3 2 1 0
src1 x 0 op 1 1 0 0 s p
5 1 5 1 1
E-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
E.4 16-Bit Opcode Maps
www.ti.com Appendix E—.M Unit Instructions and Opcode Maps
0 0 0 1 dst src2 op
5 5 5
13 12 11 10 9 8 7 6 5 4 3 2 1 0
op x 0 0 0 0 1 1 1 1 0 0 s p
5 1 1 1
SAT op Mnemonic
0 0 0 MPY (.unit) src1, src2, dst
0 0 1 MPYH (.unit) src1, src2, dst
0 1 0 MPYLH (.unit) src1, src2, dst
0 1 1 MPYHL (.unit) src1, src2, dst
1 0 0 SMPY (.unit) src1, src2, dst
1 0 1 SMPYH (.unit) src1, src2, dst
1 1 0 SMPYLH (.unit) src1, src2, dst
1 1 1 SMPYHL (.unit) src1, src2, dst
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide E-3
Submit Documentation Feedback
E.4 16-Bit Opcode Maps
Appendix E—.M Unit Instructions and Opcode Maps www.ti.com
E-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Appendix F
This appendix lists the instructions that execute in the .S functional unit and illustrates
the opcode maps for these instructions.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide F-1
Submit Documentation Feedback
F.3 32-Bit Opcode Maps
Appendix F—.S Unit Instructions and Opcode Maps www.ti.com
x cross path for src2; 0 = do not use cross path, 1 = use cross path
z test for equality with zero or nonzero
3 1 5 5
17 13 12 11 6 5 4 3 2 1 0
src1 x op 1 0 0 0 s p
5 1 6 1 1
3 1 5 5
17 13 12 11 10 9 6 5 4 3 2 1 0
src1 x 1 1 op 1 1 0 0 s p
5 1 4 1 1
F-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
F.3 32-Bit Opcode Maps
www.ti.com Appendix F—.S Unit Instructions and Opcode Maps
0 0 0 z dst src2
1 5 5
17 13 12 11 10 9 6 5 4 3 2 1 0
src1 x 1 1 op 1 1 0 0 s p
5 1 4 1 1
3 1 5 5
17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
op x 1 1 1 1 0 0 1 0 0 0 s p
5 1 1 1
creg z cst21
3 1 21
7 6 5 4 3 2 1 0
cst21 0 0 1 0 0 s p
21 1 1
Figure F-6 Call Unconditional, Immediate with Implied NOP 5 Instruction Format
31 30 29 28 27
0 0 0 z cst21
1 21
7 6 5 4 3 2 1 0
cst21 0 0 1 0 0 s p
21 1 1
creg z src2
3 1 12
15 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 0 0 0 0 1 0 0 1 0 0 0 s p
3 1 1
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide F-3
Submit Documentation Feedback
F.4 16-Bit Opcode Maps
Appendix F—.S Unit Instructions and Opcode Maps www.ti.com
creg z 0 0 0 0 1 src2 0 0
3 1 5
15 13 12 11 10 9 8 7 6 5 4 3 2 1 0
src1 x 0 0 1 1 0 1 1 0 0 0 s p
3 1 1 1
creg z 0 0 0 0 0 src2 0 0
3 1 5
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 x 0 0 1 1 0 1 1 0 0 0 s p
1 1 1
7 6 5 4 3 2 1 0
cst16 h 1 0 1 0 s p
16 1 1 1
17 13 12 8 7 6 5 4 3 2 1 0
csta cstb op 0 0 1 0 s p
5 5 2 1 1
F-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
F.4 16-Bit Opcode Maps
www.ti.com Appendix F—.S Unit Instructions and Opcode Maps
BR Mnemonic
1 BNOP (.unit) scst7, N3
BR Mnemonic
1 BNOP (.unit) ucst8, 5
BR Mnemonic
1 CALLP (.unit) scst10, 5
BR s z Mnemonic
1 0 0 [A0] BNOP .S1 scst7, N3
1 0 1 [!A0] BNOP .S1 scst7, N3
1 1 0 [B0] BNOP .S2 scst7, N3
1 1 1 [!B0] BNOP .S2 scst7, N3
BR s z Mnemonic
1 0 0 [A0] BNOP .S1 ucst8, 5
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide F-5
Submit Documentation Feedback
F.4 16-Bit Opcode Maps
Appendix F—.S Unit Instructions and Opcode Maps www.ti.com
BR s z Mnemonic
1 0 1 [!A0] BNOP .S1 ucst8, 5
1 1 0 [B0] BNOP .S2 ucst8, 5
1 1 1 [!B0] BNOP .S2 ucst8, 5
BR SAT op Mnemonic
0 0 0 ADD (.unit) src1, src2, dst
0 1 0 SADD (.unit) src1, src2, dst
0 x 1 SUB (.unit) src1, src2, dst (dst = src1 - src2)
BR op Mnemonic
0 0 SHL (.unit) src2, ucst5, dst
0 1 SHR (.unit) src2, ucst5, dst
F-6 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
F.4 16-Bit Opcode Maps
www.ti.com Appendix F—.S Unit Instructions and Opcode Maps
Mnemonic
MVK (.unit) ucst8, dst
SAT op Mnemonic
x 0 0 SHL (.unit) src2, ucst5, dst (src2 = dst)
x 0 1 SHR (.unit) src2, ucst5, dst (src2 = dst)
0 1 0 SHRU (.unit) src2, ucst5, dst (src2 = dst)
1 1 0 SSHL (.unit) src2, ucst5, dst (src2 = dst)
x 1 1 see S2sh, Figure F-21
op Mnemonic
0 0 SHL (.unit) src2, src1, dst (src2 = dst, dst = src2 << src1)
0 1 SHR (.unit) src2, src1, dst (src2 = dst, dst = src2 >> src1)
1 0 SHRU (.unit) src2, src1, dst (src2 = dst, dst = src2 << src1)
1 1 SSHL (.unit) src2, src1, dst (src2 = dst, dst = src2 sshl src1)
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide F-7
Submit Documentation Feedback
F.4 16-Bit Opcode Maps
Appendix F—.S Unit Instructions and Opcode Maps www.ti.com
op Mnemonic
0 0 EXTU (.unit) src2, ucst5,31, A0/B0
0 1 SET (.unit) src2, ucst5, ucst5, dst (src = dst, ucst5 = ucst5)
1 0 CLR (.unit) src2, ucst5, ucst5, dst (src = dst, ucst5 = ucst5)
1 1 see S2ext, Figure F-23
op Mnemonic
0 0 EXT (.unit) src,16, 16, dst
0 1 EXT (.unit) src,24, 24, dst
1 0 EXTU (.unit) src,16, 16, dst
1 1 EXTU (.unit) src,24, 24, dst
op Mnemonic
0 ADD (.unit) src1, src2, dst (src1 = dst)
1 SUB (.unit) src1, src2, dst (src1 = dst, dst = src1 - src2)
F-8 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
F.4 16-Bit Opcode Maps
www.ti.com Appendix F—.S Unit Instructions and Opcode Maps
Mnemonic
ADDK (.unit) ucst5, dst
op Mnemonic
0 0 0 see LSDx1, Figure G-4
0 0 1 see LSDx1, Figure G-4
0 1 0 SUB (.unit)0, src2, dst (src2 = dst, dst = 0 - src2)
0 1 1 ADD (.unit)-1, src2, dst (src2 = dst)
1 0 0 Reserved
1 0 1 see LSDx1, Figure G-4
1 1 0 MVC (.unit) src, ILC (s = 1)
1 1 1 see LSDx1, Figure G-4
Mnemonic
BNOP (.unit) src2, N3
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide F-9
Submit Documentation Feedback
F.4 16-Bit Opcode Maps
Appendix F—.S Unit Instructions and Opcode Maps www.ti.com
F-10 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Appendix G
This appendix illustrates the opcode maps that execute in the .D, .L, or .S functional
units.
For a list of the instructions that execute in the .D functional unit, see Appendix C ‘‘.D
Unit Instructions and Opcode Maps’’ on page C-1. For a list of the instructions that
execute in the .L functional unit, see Appendix D ‘‘.L Unit Instructions and Opcode
Maps’’ on page D-1. For a list of the instructions that execute in the .S functional unit,
see Appendix F ‘‘.S Unit Instructions and Opcode Maps’’ on page F-1.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide G-1
Submit Documentation Feedback
G.2 32-Bit Opcode Maps
Appendix G—.D, .L, or .S Unit Opcode Maps www.ti.com
unit Mnemonic
0 0 MV (.Ln) src, dst
0 1 MV (.Sn) src, dst
1 0 MV (.Dn) src, dst
unit Mnemonic
0 0 MV (.Ln) src, dst
0 1 MV (.Sn) src, dst
1 0 MV (.Dn) src, dst
G-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
G.3 16-Bit Opcode Maps
www.ti.com Appendix G—.D, .L, or .S Unit Opcode Maps
CC Mnemonic
0 0 [A0] MVK (.unit) ucst1, dst
0 1 [!A0] MVK (.unit) ucst1, dst
1 0 [B0] MVK (.unit) ucst1, dst
1 1 [!B0] MVK (.unit) ucst1, dst
CC unit Mnemonic
0 0 0 0 [A0] MVK (.Ln) ucst1, dst
0 1 [A0] MVK (.Sn) ucst1, dst
1 0 [A0] MVK (.Dn) ucst1, dst
CC unit Mnemonic
0 1 0 0 [!A0] MVK (.Ln) ucst1, dst
0 1 [!A0] MVK (.Sn) ucst1, dst
1 0 [!A0] MVK (.Dn) ucst1, dst
CC unit Mnemonic
1 0 0 0 [B0] MVK (.Ln) ucst1, dst
0 1 [B0] MVK (.Sn) ucst1, dst
1 0 [B0] MVK (.Dn) ucst1, dst
CC unit Mnemonic
1 1 0 0 [!B0] MVK (.Ln) ucst1, dst
0 1 [!B0] MVK (.Sn) ucst1, dst
1 0 [!B0] MVK (.Dn) ucst1, dst
op Mnemonic
0 0 0 MVK (.unit)0, dst
0 0 1 MVK (.unit)1, dst
0 1 0 See Dx1, Figure C-20; Lx1, Figure D-11; and Sx1, Figure F-26
0 1 1 See Dx1, Figure C-20; Lx1, Figure D-11; and Sx1, Figure F-26
1 0 0 See Dx1, Figure C-20; Lx1, Figure D-11; and Sx1, Figure F-26
1 0 1 ADD (.unit) src, 1, dst (src = dst)
1 1 0 See Dx1, Figure C-20; Lx1, Figure D-11; and Sx1, Figure F-26
1 1 1 XOR (.unit) src, 1, dst (src = dst)
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide G-3
Submit Documentation Feedback
G.3 16-Bit Opcode Maps
Appendix G—.D, .L, or .S Unit Opcode Maps www.ti.com
op unit Mnemonic
0 0 0 0 0 MVK (.Ln)0, dst
0 1 MVK (.Sn)0, dst
1 0 MVK (.Dn)0, dst
op unit Mnemonic
0 0 1 0 0 MVK (.Ln)1, dst
0 1 MVK (.Sn)1, dst
1 0 MVK (.Dn)1, dst
op unit Mnemonic
1 0 1 0 0 ADD (.Ln) src, 1, dst
0 1 ADD (.Sn) src, 1, dst
1 0 ADD (.Dn) src, 1, dst
op unit Mnemonic
1 1 1 0 0 XOR (.Ln) src, 1, dst
0 1 XOR (.Sn) src, 1, dst
1 0 XOR (.Dn) src, 1, dst
G-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
Appendix H
This appendix lists the instructions that execute with no unit specified and illustrates
the opcode maps for these instructions.
For a list of the instructions that execute in the .D functional unit, see Appendix C ‘‘.D
Unit Instructions and Opcode Maps’’ on page C-1. For a list of the instructions that
execute in the .L functional unit, see Appendix D ‘‘.L Unit Instructions and Opcode
Maps’’ on page D-1. For a list of the instructions that execute in the .M functional unit,
see Appendix E ‘‘.M Unit Instructions and Opcode Maps’’ on page E-1. For a list of the
instructions that execute in the .S functional unit, see Appendix F ‘‘.S Unit Instructions
and Opcode Maps’’ on page F-1.
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide H-1
Submit Documentation Feedback
H.1 Instructions Executing With No Unit Specified
Appendix H—No Unit Specified Instructions and Opcode Maps www.ti.com
N3 3-bit field
op opfield; field within opcode that specifies a unique instruction
p parallel execution; 0 = next instruction is not executed in parallel, 1 = next instruction is
executed in parallel
s side A or B for destination; 0 = side A, 1 = side B.
H-2 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
H.3 32-Bit Opcode Maps
www.ti.com Appendix H—No Unit Specified Instructions and Opcode Maps
3 1 5 5
16 13 12 11 10 9 8 7 6 5 4 3 2 1 0
op 0 0 0 0 0 0 0 0 0 0 0 s p
4 1 1
0 0 0 1 Reserved (0) 0
10
16 13 12 11 10 9 8 7 6 5 4 3 2 1 0
op 0 0 0 0 0 0 0 0 0 0 0 s p
4 1 1
op Mnemonic
0 SPLOOP ii (ii = real ii - 1)
1 SPLOOPD ii
SPRUGH7—November 2010 TMS320C66x DSP CPU and Instruction Set Reference Guide H-3
Submit Documentation Feedback
H.4 16-Bit Opcode Maps
Appendix H—No Unit Specified Instructions and Opcode Maps www.ti.com
op Mnemonic
0 [A0] SPLOOPD ii (ii = real ii - 1)
1 [B0] SPLOOPD ii
Mnemonic
SPKERNEL ii/stage
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
D2 D1 1 0 1 1 S2 S1 L2 1 1 0 0 1 1 L1
1 1 1 1 1 1
NOTE: Supports masking of D1, D2, L1, L2, S1, and S2 instructions (not M1 or M2)
Mnemonic
SPMASK unitmask
b) SPMASKR Instruction
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
D2 D1 1 1 1 1 S2 S1 L2 1 1 0 0 1 1 L1
1 1 1 1 1 1
NOTE: Supports masking of D1, D2, L1, L2, S1, and S2 instructions (not M1 or M2)
Mnemonic
SPMASKR unitmask
Mnemonic
NOP N3
H-4 TMS320C66x DSP CPU and Instruction Set Reference Guide SPRUGH7—November 2010
Submit Documentation Feedback
IMPORTANT NOTICE
Texas Instruments Incorporated and its subsidiaries (TI) reserve the right to make corrections, modifications, enhancements, improvements,
and other changes to its products and services at any time and to discontinue any product or service without notice. Customers should
obtain the latest relevant information before placing orders and should verify that such information is current and complete. All products are
sold subject to TI’s terms and conditions of sale supplied at the time of order acknowledgment.
TI warrants performance of its hardware products to the specifications applicable at the time of sale in accordance with TI’s standard
warranty. Testing and other quality control techniques are used to the extent TI deems necessary to support this warranty. Except where
mandated by government requirements, testing of all parameters of each product is not necessarily performed.
TI assumes no liability for applications assistance or customer product design. Customers are responsible for their products and
applications using TI components. To minimize the risks associated with customer products and applications, customers should provide
adequate design and operating safeguards.
TI does not warrant or represent that any license, either express or implied, is granted under any TI patent right, copyright, mask work right,
or other TI intellectual property right relating to any combination, machine, or process in which TI products or services are used. Information
published by TI regarding third-party products or services does not constitute a license from TI to use such products or services or a
warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual
property of the third party, or a license from TI under the patents or other intellectual property of TI.
Reproduction of TI information in TI data books or data sheets is permissible only if reproduction is without alteration and is accompanied
by all associated warranties, conditions, limitations, and notices. Reproduction of this information with alteration is an unfair and deceptive
business practice. TI is not responsible or liable for such altered documentation. Information of third parties may be subject to additional
restrictions.
Resale of TI products or services with statements different from or beyond the parameters stated by TI for that product or service voids all
express and any implied warranties for the associated TI product or service and is an unfair and deceptive business practice. TI is not
responsible or liable for any such statements.
TI products are not authorized for use in safety-critical applications (such as life support) where a failure of the TI product would reasonably
be expected to cause severe personal injury or death, unless officers of the parties have executed an agreement specifically governing
such use. Buyers represent that they have all necessary expertise in the safety and regulatory ramifications of their applications, and
acknowledge and agree that they are solely responsible for all legal, regulatory and safety-related requirements concerning their products
and any use of TI products in such safety-critical applications, notwithstanding any applications-related information or support that may be
provided by TI. Further, Buyers must fully indemnify TI and its representatives against any damages arising out of the use of TI products in
such safety-critical applications.
TI products are neither designed nor intended for use in military/aerospace applications or environments unless the TI products are
specifically designated by TI as military-grade or "enhanced plastic." Only products designated by TI as military-grade meet military
specifications. Buyers acknowledge and agree that any such use of TI products which TI has not designated as military-grade is solely at
the Buyer's risk, and that they are solely responsible for compliance with all legal and regulatory requirements in connection with such use.
TI products are neither designed nor intended for use in automotive applications or environments unless the specific TI products are
designated by TI as compliant with ISO/TS 16949 requirements. Buyers acknowledge and agree that, if they use any non-designated
products in automotive applications, TI will not be responsible for any failure to meet such requirements.
Following are URLs where you can obtain information on other Texas Instruments products and application solutions:
Products Applications
Amplifiers amplifier.ti.com Audio www.ti.com/audio
Data Converters dataconverter.ti.com Automotive www.ti.com/automotive
DLP® Products www.dlp.com Communications and www.ti.com/communications
Telecom
DSP dsp.ti.com Computers and www.ti.com/computers
Peripherals
Clocks and Timers www.ti.com/clocks Consumer Electronics www.ti.com/consumer-apps
Interface interface.ti.com Energy www.ti.com/energy
Logic logic.ti.com Industrial www.ti.com/industrial
Power Mgmt power.ti.com Medical www.ti.com/medical
Microcontrollers microcontroller.ti.com Security www.ti.com/security
RFID www.ti-rfid.com Space, Avionics & www.ti.com/space-avionics-defense
Defense
RF/IF and ZigBee® Solutions www.ti.com/lprf Video and Imaging www.ti.com/video
Wireless www.ti.com/wireless-apps
Mailing Address: Texas Instruments, Post Office Box 655303, Dallas, Texas 75265
Copyright © 2010, Texas Instruments Incorporated