Sprueb 8 B
Sprueb 8 B
Sprueb 8 B
Notational Conventions
This document uses the following conventions:
- Hexadecimal numbers are shown with the suffix h. For example, the
following number is 40 hexadecimal (decimal 64): 40h.
Trademarks
C6000, TMS320C64x+, TMS320C64x, C64x are trademarks of Texas
Instruments.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
Provides a brief introduction to the TI C64x+ DSP Library (DSPLIB), shows the organization
of the routines contained in the library, and lists the features and benefits of the DSPLIB
1.1 Introduction to the TI C64x+ DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
1.2 Features and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
1.3 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5
iii
Contents
DSP_fft16x16r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12
DSP_fft16x32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
DSP_fft32x32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24
DSP_fft32x32s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26
DSP_ifft16x16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28
DSP_ifft16x16_imre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-30
DSP_ifft16x32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-32
DSP_ifft32x32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-34
4.4 Filtering and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-36
DSP_fir_cplx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-36
DSP_fir_cplx_hM4X4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-38
DSP_fir_gen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-40
DSP_fir_gen_hM17_rA8X8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-42
DSP_fir_r4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-44
DSP_fir_r8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-46
DSP_fir_r8_hM16_rM8A8X8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-48
DSP_fir_sym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-50
DSP_iir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-52
DSP_iirlat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-54
4.5 Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-56
DSP_dotp_sqr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-56
DSP_dotprod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-58
DSP_maxval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-60
DSP_maxidx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-61
DSP_minval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-63
DSP_mul32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-64
DSP_neg32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-66
DSP_recip16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-67
DSP_vecsumsq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-69
DSP_w_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-70
4.6 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-71
DSP_mat_mul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-71
DSP_mat_trans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-73
4.7 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-74
DSP_bexp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-74
DSP_blk_eswap16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-76
DSP_blk_eswap32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-78
DSP_blk_eswap64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-80
DSP_blk_move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-82
DSP_fltoq15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-83
DSP_minerror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-85
DSP_q15tofl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-87
5 Performance/Fractional Q Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
Describes performance considerations related to the C64x+ DSPLIB and provides information
about the Q format used by DSPLIB functions.
A.1 Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
iv
Contents
7 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Defines terms and abbreviations used in this book..
8 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index-1
Contents v
Tables
2−1 . . . DSPLIB Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
3−1 . . . Argument Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3−2 . . . Adaptive Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3−3 . . . Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3−4 . . . FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3−5 . . . Filtering and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
3−6 . . . Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
3−7 . . . Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
3−8 . . . Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
A−1 . . Q3.12 Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A−2 . . Q.15 Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A−3 . . Q.31 Low Memory Location Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A−4 . . Q.31 High Memory Location Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
vi
Contents vii
Chapter 1
This chapter provides a brief introduction to the TI C64x+ DSP Library (DSPLIB), shows the
organization of the routines contained in the library, and lists the features and benefits of the
DSPLIB.
Topic Page
1-1
Introduction to the TI C64x+ DSPLIB
- Adaptive filtering
J DSP_firlms2
- Correlation
J DSP_autocor
- FFT
J DSP_fft16x16
J DSP_fft16x16_imre
J DSP_fft16x16r
J DSP_fft16x32
J DSP_fft32x32
J DSP_fft32x32s
J DSP_ifft16x16
J DSP_ifft16x16_imre
J DSP_ifft16x32
J DSP_ifft32x32
1-2
Introduction to the TI C64x+ DSPLIB
- Math
J DSP_dotp_sqr
J DSP_dotprod
J DSP_maxval
J DSP_maxidx
J DSP_minval
J DSP_mul32
J DSP_neg32
J DSP_recip16
J DSP_vecsumsq
J DSP_w_vec
- Matrix
J DSP_mat_mul
J DSP_mat_trans
- Miscellaneous
J DSP_bexp
J DSP_blk_eswap16
J DSP_blk_eswap32
J DSP_blk_eswap64
J DSP_blk_move
J DSP_fltoq15
J DSP_minerror
J DSP_q15tofl
Introduction 1-3
Features and Benefits
1-4
Optimization Techniques
Introduction 1-5
Chapter 2
This chapter provides information on how to install and rebuild the TI C64x+ DSPLIB.
Topic Page
2-1
How to Install DSPLIB
Note:
You should read the README.txt file for specific details of the release.
c64plus
|
+−−dsplib
|
+−−docs Library documentation
|
+−−example Example to show DSPLIB usage
|
+−−src Source code with CCS project
examples
| |+−−[Kernels]
|
|−−dsplib64plus.h Header file containing kernel
definitions
|
|−−dsplib64plus.lib Precompiled library
|
|−−dsplib64plus.pjt Provided project to rebuild
library
|
|−−README.txt Top−level README file
After completing the installation of DSPLIB, follow the instructions in sections 2.2.2 and 2.2.3 on
how to call the functions from your source code.
2-2
Using DSPLIB
Size
Name (bits) Type Minimum Maximum
short 16 integer −32768 32767
Unless specifically noted, DSPLIB operates on Q.15-fractional data type elements. Appendix A
presents an overview of Fractional Q formats.
The source code and CCS project in the “c64plus\dsplib\example\” directory shows how to use the
DSPLIB in a Code Composer Studio C environment.
2-4
How to Rebuild DSPLIB
2. Open the ’dsplib64plus.pjt’ project by clicking Project −> Open in the menu
and browse to the ’c64plus\dsplib\’ directory.
This chapter provides tables containing all DSPLIB functions, a brief description of each, and a
page reference for more detailed information.
Topic Page
3-1
Arguments and Conventions Used
Argument Description
x,y Argument reflecting input data vector
nx,ny,nr Arguments reflecting the size of vectors x,y, and r, respectively. For
functions in the case nx = ny = nr, only nx has been used across.
Some C64x+ functions have additional restrictions due to optimization using new features such as
higher multiply throughput. While these new functions perform better, they can also lead to
problems if not carefully used. Therefore, the new functions are named with any additional
restrictions. Three types of restrictions are specified to a pointer: minimum buffer size (M), buffer
alignment (A), and the number of elements in the buffer to be a multiple of an integer (X).The
following convention has been used when describing the arguments for each individual function:
A kernel function foo with two parameters, m and n, with the following restrictions:
m −> Minimum buffer size = 8, buffer alignment = double word, buffer
needs to be a multiple of 8 elements
n −> Minimum buffer size = 32, buffer alignment = word , buffer needs to be
a multiple of 16 elements
This function would be named: foo_mM8A8X8_nM32A4X16.
3-2
DSPLIB Functions
- Adaptive filtering
- Correlation
- FFT
- Filtering and convolution
- Math
- Matrix functions
- Miscellaneous
void DSP_fft16x16_imre(short *w, int nx, short *x, short Complex out of place, Forward 4-11
*y) FFT mixed radix with digit
reversal. Input/Output data in
Im/Re order.
void DSP_fft16x16r(int nx, short *x, short *w, unsigned Mixed radix FFT with scaling and 4-12
char *brev, short *y, int offset, int n_max) rounding, digit reversal, out of
place. Input and output: 16 bits,
Twiddle factor: 16 bits.
void DSP_fft16x32(short *w, int nx, int *x, int *y) Extended precision, mixed radix 4-22
FFT, rounding, digit reversal, out
of place. Input and output: 32 bits,
Twiddle factor: 16 bits.
void DSP_fft32x32(int *w, int nx, int *x, int *y) Extended precision, mixed radix 4-24
FFT, rounding, digit reversal, out
of place. Input and output: 32 bits,
Twiddle factor: 32 bits.
void DSP_fft32x32s(int *w, int nx, int *x, int *y) Extended precision, mixed radix 4-26
FFT, digit reversal, out of place.,
with scaling and rounding. Input
and output: 32 bits, Twiddle
factor: 32 bits.
3-4
DSPLIB Function Tables
void DSP_ifft16x16_imre(short *w, int nx, short *x, short Complex out of place, Inverse 4-26
*y) FFT mixed radix with digit
reversal. Input/Output data in
Re/Im order.
void DSP_ifft16x32(short *w, int nx, int *x, int *y) Extended precision, mixed radix 4-32
IFFT, rounding, digit reversal, out
of place. Input and output: 32 bits,
Twiddle factor: 16 bits.
void DSP_ifft32x32(int *w, int nx, int *x, int *y) Extended precision, mixed radix 4-34
IFFT, digit reversal, out of place,
with scaling and rounding. Input
and output: 32 bits, Twiddle
factor: 32 bits.
void DSP_fir_cplx_hM4X4(short *x, short *h, short *r, int Complex FIR Filter (nh is a 4-36
nh, int nr) multiple of 4)
void DSP_fir_gen (short *x, short *h, short *r, int nh, int nr) FIR Filter (any nh) 4-40
void DSP_fir_gen_hM17_rA8X8 (short *x, short *h, short FIR Filter (r[] must be double 4-40
*r, int nh, int nr) word aligned, nr must be multiple
of 8)
void DSP_fir_r4 (short *x, short *h, short *r, int nh, int nr) FIR Filter (nh is a multiple of 4) 4-44
void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr) FIR Filter (nh is a multiple of 8) 4-48
void DSP_fir_r8_hM16_rM8A8X8 (short *x, short *h, short FIR Filter (r[] must be double 4-48
*r, int nh, int nr) word aligned, nr is a multiple of 8)
void DSP_fir_sym (short *x, short *h, short *r, int nh, int nr, Symmetric FIR Filter (nh is a 4-50
int s) multiple of 8)
void DSP_iir_lat(short *x, int nx, short *k, int nk, int *b, All−pole IIR Lattice Filter 4-54
short *r)
int DSP_dotprod(short *x, short *y, int nx) Vector Dot Product 4-58
short DSP_maxval (short *x, int nx) Maximum Value of a Vector 4-60
int DSP_maxidx (short *x, int nx) Index of the Maximum Element of 4-61
a Vector
short DSP_minval (short *x, int nx) Minimum Value of a Vector 4-63
void DSP_mul32(int *x, int *y, int *r, short nx) 32-bit Vector Multiply 4-64
void DSP_neg32(int *x, int *r, short nx) 32-bit Vector Negate 4-66
void DSP_recip16 (short *x, short *rfrac, short *rexp, short 16-bit Reciprocal 4-67
nx)
void DSP_w_vec(short *x, short *y, short m, short *r, short Weighted Vector Sum 4-70
nr)
void DSP_mat_trans(short *x, short rows, short columns, Matrix Transpose 4-73
short *r)
3-6
DSPLIB Function Tables
void DSP_blk_eswap16(void *x, void *r, int nx) Endian-swap a block of 16-bit 4-76
values
void DSP_blk_eswap32(void *x, void *r, int nx) Endian-swap a block of 32-bit 4-78
values
void DSP_blk_eswap64(void *x, void *r, int nx) Endian-swap a block of 64-bit 4-80
values
void DSP_blk_move(short *x, short *r, int nx) Move a Block of Memory 4-82
void DSP_fltoq15 (float *x,short *r, short nx) Float to Q15 Conversion 4-83
int DSP_minerror (short *GSP0_TABLE,short *errCoefs, Minimum Energy Error Search 4-85
int *savePtr_ret)
void DSP_q15tofl (short *x, float *r, short nx) Q15 to Float Conversion 4-87
This chapter provides a list of the functions within the DSP library (DSPLIB) organized into
functional categories. The functions within each category are listed in alphabetical order and
include arguments, descriptions, algorithms, benchmarks, and special requirements.
Topic Page
4-1
DSP_firlms2
Description The Least Mean Square Adaptive Filter computes an update of all nh
coefficients by adding the weighted error times the inputs to the original
coefficients. The input array includes the last nh inputs followed by a new
single sample input. The coefficient array includes nh coefficients.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
long DSP_firlms2(short h[ ],short x[ ], short b,
int nh)
{
int i;
long r = 0;
for (i = 0; i < nh; i++) {
h[i] += (x[i] * b) >> 15;
r += x[i + 1] * h[i];
}
return r;
}
Special Requirements
4-2
DSP_firlms2
Implementation Notes
- Interruptibility: The code is interruptible.
- The loop is unrolled 4 times.
4.2 Correlation
DSP_autocor AutoCorrelation
Function void DSP_autocor(short * restrict r, short * restrict x, int nx, int nr)
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_autocor(short r[ ],short x[ ], int nx, int nr)
{
int i,k,sum;
for (i = 0; i < nr; i++){
sum = 0;
for (k = nr; k < nx+nr; k++)
sum += x[k] * x[k−i];
r[i] = (sum >> 15);
}
}
Special Requirements
- nx must be a multiple of 8.
- nr must be a multiple of 4.
- x[ ] must be double-word aligned.
4-4
DSP_autocor
Implementation Notes
4.3 FFT
Function void DSP_fft16x16(short * restrict w, int nx, short * restrict x, short * restrict y)
Description This routine computes a complex forward mixed radix FFT with rounding and
digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.
The output is returned in the separate array y[ ] in normal order. Each complex
value is stored with interleaved real and imaginary parts. The code uses a
special ordering of FFT coefficients (also called twiddle factors) and memory
accesses to improve performance in the presence of cache.
Algorithm All stages are radix-4 except the last one, which can be radix-2 or radix-4,
depending on the size of the FFT. All stages except the last one scale by two
the stage output data.
Special Requirements
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices. All data are in short precision
or Q.15 format.
4-6
DSP_fft16x16
Implementation Notes
The routine uses log4(nx) − 1 stages of radix-4 transform and performs either a radix-2 or radix-4
transform on the last stage depending on nx. If nx is a power of 4, then this last stage is also a
radix-4 transform, otherwise it is a radix-2 transform. The conventional Cooley Tukey FFT is
written using three loops. The outermost loop “k” cycles through the stages. There are log N to
the base 4 stages in all. The loop “j” cycles through the groups of butterflies with different twiddle
factors, and loop “i” reuses the twiddle factors for the different butterflies within a stage. Note the
following:
2 N/16 4 N/4
.. .. .. ..
1) Inner loop “i0” iterates a variable number of times. In particular, the number
of iterations quadruples every time from 1..N/4. Hence, software pipelining
a loop that iterates a variable number of times is not profitable.
2) Outer loop “j” iterates a variable number of times as well. However, the
number of iterations is quartered every time from N/4 ..1. Hence, the
behavior in (a) and (b) are exactly opposite to each other.
3) If the two loops “i” and “j” are coalesced together then they will iterate for
a fixed number of times, namely N/4. This allows us to combine the “i” and
“j” loops into one loop. Optimized implementations will make use of this
fact.
In addition,, the Cooley Tukey FFT accesses three twiddle factors per iteration
of the inner loop, as the butterflies that reuse twiddle factors are lumped
together. This leads to accessing the twiddle factor array at three points, each
separated by “ie”. Note that “ie” is initially 1, and is quadrupled with every
iteration. Therefore, these three twiddle factors are not even contiguous in the
array.
To vectorize the FFT, it is desirable to access the twiddle factor array using
double word wide loads and fetch the twiddle factors needed. To do this, a
modified twiddle factor array is created, in which the factors WN/4, WN/2,
W3N/4 are arranged to be contiguous. This eliminates the separation between
twiddle factors within a butterfly. However, this implies that we maintain a
redundant version of the twiddle factor array as the loop is traversed from one
stage to another. Hence, the size of the twiddle factor array increases as
compared to the normal Cooley Tukey FFT. The modified twiddle factor array
is of size “2 * N” where the conventional Cooley Tukey FFT is of size “3N/4”
where N is the number of complex points to be transformed. The routine that
generates the modified twiddle factor array was presented earlier. With the
above transformation of the FFT, both the input data and the twiddle factor
array can be accessed using double-word wide loads to enable packed data
processing.
The final stage is optimized to remove the multiplication as w0 = 1. This stage
also performs digit reversal on the data, so the final output is in natural order.
In addition, if the number of points to be transformed is a power of 2, the final
stage applies a radix-2 pass instead of a radix-4. In any case, the outputs are
returned in normal order.
The code performs the bulk of the computation in place. However, because
digit-reversal cannot be performed in-place, the final result is written to a
separate array, y[].
4-8
DSP_fft16x16_imre
DSP_fft16x16_imre Complex Forward Mixed Radix 16 x 16-bit FFT, With Im/Re Order
Function void DSP_fft16x16_imre(short * restrict w, int nx, short * restrict x, short * re-
strict y)
Description This routine computes a complex forward mixed radix FFT with rounding and
digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.
The output is returned in the separate array y[ ] in normal order. Each complex
value is stored with interleaved imaginary and real parts. The code uses a
special ordering of FFT coefficients (also called twiddle factors) and memory
accesses to improve performance in the presence of cache.
Algorithm All stages are radix-4 except the last one, which can be radix-2 or radix-4,
depending on the size of the FFT. All stages except the last one scale by two
the stage output data.
Special Requirements
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the imaginary/real
components stored in adjacent locations in the array. The imaginary
components are stored at even array indices, and the real components are
stored at odd array indices. All data are in short precision or Q.15 format.
Implementation Notes
- Interruptibility: The code is interruptible.
The routine uses log4(nx) − 1 stages of radix-4 transform and performs either
a radix-2 or radix-4 transform on the last stage depending on nx. If nx is a
power of 4, then this last stage is also a radix-4 transform, otherwise it is a
radix-2 transform. The conventional Cooley Tukey FFT is written using three
loops. The outermost loop “k” cycles through the stages. There are log N to
the base 4 stages in all. The loop “j” cycles through the groups of butterflies
with different twiddle factors, and loop “i” reuses the twiddle factors for the
different butterflies within a stage. Note the following:
2 N/16 4 N/4
.. .. .. ..
1) Inner loop “i0” iterates a variable number of times. In particular, the number
of iterations quadruples every time from 1..N/4. Hence, software pipelining
a loop that iterates a variable number of times is not profitable.
2) Outer loop “j” iterates a variable number of times as well. However, the
number of iterations is quartered every time from N/4 ..1. Hence, the
behavior in (a) and (b) are exactly opposite to each other.
3) If the two loops “i” and “j” are coalesced together then they will iterate for
a fixed number of times, namely N/4. This allows us to combine the “i” and
“j” loops into one loop. Optimized implementations will make use of this
fact.
In addition, the Cooley Tukey FFT accesses three twiddle factors per iteration
of the inner loop, as the butterflies that reuse twiddle factors are lumped
together. This leads to accessing the twiddle factor array at three points, each
separated by “ie”. Note that “ie” is initially 1, and is quadrupled with every
iteration. Therefore these three twiddle factors are not even contiguous in the
array.
4-10
DSP_fft16x16_imre
To vectorize the FFT, it is desirable to access twiddle factor array using double
word wide loads and fetch the twiddle factors needed. To do this, a modified
twiddle factor array is created, in which the factors WN/4, WN/2, W3N/4 are
arranged to be contiguous. This eliminates the separation between twiddle
factors within a butterfly. However, this implies that we maintain a redundant
version of the twiddle factor array as the loop is traversed from one stage to
another. Hence, the size of the twiddle factor array increases as compared to
the normal Cooley Tukey FFT. The modified twiddle factor array is of size
“2 * N”, where the conventional Cooley Tukey FFT is of size “3N/4”, where N
is the number of complex points to be transformed. The routine that generates
the modified twiddle factor array was presented earlier. With the above
transformation of the FFT, both the input data and the twiddle factor array can
be accessed using double-word wide loads to enable packed data processing.
The final stage is optimized to remove the multiplication as w0 = 1. This stage
also performs digit reversal on the data, so the final output is in natural order.
In addition, if the number of points to be transformed is a power of 2, the final
stage applies a DSP_radix2 pass instead of a radix 4. In any case, the outputs
are returned in normal order.
The code performs the bulk of the computation in place. However, because
digit-reversal cannot be performed in-place, the final result is written to a
separate array, y[].
Function void DSP_fft16x16r(int nx, short * restrict x, short * restrict w, short * restrict y,
int radix, int offset, int nmax)
Description This routine implements a complex forward mixed radix FFT with scaling,
rounding and digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ]
are 16-bit. The output is returned in the separate array y[ ] in normal order.
Each complex value is stored as interleaved 16-bit real and imaginary parts.
The code uses a special ordering of FFT coefficients (also called twiddle
factors).
This redundant set of twiddle factors is size 2*N short samples. As pointed out
in subsequent sections, dividing these twiddle factors by 2 will give an effective
divide by 4 at each stage to guarantee no overflow. The function is accurate
to about 68dB of signal to noise ratio to the DFT function as follows.
4-12
DSP_fft16x16r
The function takes the twiddle factors and input data, and calculates the FFT
producing the frequency domain data in the y[ ] array. As the FFT allows every
input point to affect every output point, which causes cache thrashing in a
cache based system. This is mitigated by allowing the main FFT of size N to
be divided into several steps, allowing as much data reuse as possible. For
example, see the following function:
4-14
DSP_fft16x16r
DSP_fft16x16r(N/4,&x[0], &w[2*3*N/4],y,rad,0, N)
DSP_fft16x16r(N/4,&x[2*N/4], &w[2*3*N/4],y,rad,N/4, N)
DSP_fft16x16r(N/4,&x[2*N/2], &w[2*3*N/4],y,rad,N/2, N)
DSP_fft16x16r(N/4,&x[2*3*N/4],&w[2*3*N/4],y,rad,3*N/4,N)
As discussed previously, N can be either a power of 4 or 2. If N is a power of
4, then rad = 4, and if N is a power of 2 and not a power of 4, then rad = 2. “rad”
controls how many stages of decomposition are performed. It also determines
whether a radix4 or DSP_radix2 decomposition should be performed at the
last stage. Hence, when “rad” is set to “N/4”, the first stage of the transform
alone is performed and the code exits. To complete the FFT, four other calls
are required to perform N/4 size FFTs. In fact, the ordering of these 4 FFTs
amongst themselves does not matter and, thus, from a cache perspective, it
helps to go through the remaining 4 FFTs in exactly the opposite order to the
first. This is illustrated as follows:
DSP_fft16x16r(N, &x[0], &w[0], y,N/4,0, N)
DSP_fft16x16r(N/4,&x[2*3*N/4],&w[2*3*N/4],y,rad,3*N/4, N)
DSP_fft16x16r(N/4,&x[2*N/2], &w[2*3*N/4],y,rad,N/2, N)
DSP_fft16x16r(N/4,&x[2*N/4], &w[2*3*N/4],y,rad,N/4, N)
DSP_fft16x16r(N/4,&x[0], &w[2*3*N/4],y,rad,0, N)
In addition, this function can be used to minimize call overhead by completing
the FFT with one function call invocation as shown below:
DSP_fft16x16r(N, &x[0], &w[0], y, rad, 0, N)
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void fft16x16r
(
int n,
short *ptr_x,
short *ptr_w,
short *y,
int radix,
int offset,
int nmax
)
{
int i, l0, l1, l2, h2, predj;
int l1p1,l2p1,h2p1, tw_offset, stride, fft_jmp;
4-16
DSP_fft16x16r
x_h2p1 = x[h2+1];
x_l1 = x[l1];
x_l1p1 = x[l1+1];
x_l2 = x[l2];
x_l2p1 = x[l2+1];
4-18
DSP_fft16x16r
xh0_0 = x0 + x4;
xh1_0 = x1 + x5;
xh0_1 = x2 + x6;
xh1_1 = x3 + x7;
if (radix == 2)
{
xh0_0 = x0;
xh1_0 = x1;
xh0_1 = x2;
xh1_1 = x3;
}
xl0_0 = x4;
xl1_0 = x5;
xl1_1 = x6;
xl0_1 = x7;
}
yt2 = xl0_0 + xl1_1;
yt3 = xl1_0 − xl0_1;
yt6 = xl0_0 − xl1_1;
yt7 = xl1_0 + xl0_1;
if (radix == 2)
{
yt7 = xl1_0 − xl0_1;
yt3 = xl1_0 + xl0_1;
}
y0[k] = yt0; y0[k+1] = yt1;
k += n>>1;
y0[k] = yt2; y0[k+1] = yt3;
k += n>>1;
y0[k] = yt4; y0[k+1] = yt5;
k += n>>1;
y0[k] = yt6; y0[k+1] = yt7;
}
}
Special Requirements
- nx must be a power of 2 or 4.
- All data are in short precision or Q.15 format. Allowed input dynamic range
is 16 − (log2(nx)−ceil[log4(nx)−1]).
- The FFT coefficients (twiddle factors) are generated using the function
gen_twiddle_fft16x16 provided in the ’c64plus\dsplib\src\DSP_fft16x16r’
4-20
DSP_fft16x16r
directory. The scale factor must be 32767.5. The input data must be scaled
by 2(log2(nx)−ceil[log4(nx)−1]) to completely prevent overflow.
Implementation Notes
- The butterfly is bit reversed; i.e. the inner 2 points of the butterfly are
crossed over. This makes the data come out in bit reversed rather than in
radix 4 digit reversed order. This simplifies the last pass of the loop. The
BITR instruction does the bit reversal out of place.
Function void DSP_fft16x32(short * restrict w, int nx, int * restrict x, int * restrict y)
Description This routine computes an extended precision complex forward mixed radix
FFT with rounding and digit reversal. Input data x[ ] and output data y[ ] are
32-bit, coefficients w[ ] are 16-bit. The output is returned in the separate array
y[ ] in normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache. The C code to generate the twiddle factors is provided with
this library in the ’c64plus\dsplib\src\DSP_fft16x32’ directory.
Algorithm For further details, see the source code of the C and Optimized C version of
this function that is provided in the ’c64plus\dsplib\src\DSP_fft16x32’
directory.
Special Requirements
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
- The FFT coefficients (twiddle factors) are generated using the function
gen_twiddle_fft16x32 provided in the ’c64plus\dsplib\src\DSP_fft16x32’
directory. The scale factor must be 32767.5. No scaling is done with the
function; thus the input data must be scaled by 2(log2(nx)−ceil[log4(nx)−1]) to
completely prevent overflow.
4-22
DSP_fft16x32
Implementation Notes
Function void DSP_fft32x32(int * restrict w, int nx, int * restrict x, int * restrict y)
Description This routine computes an extended precision complex forward mixed radix
FFT with rounding and digit reversal. Input data x[ ], output data y[ ], and
coefficients w[ ] are 32-bit. The output is returned in the separate array y[ ] in
normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache. The C code to generate the twiddle factors is provided with
this library in the ’c64plus\dsplib\src\DSP_fft32x32’ directory.
Algorithm For further details, see the source code of the C and Optimized C version of
this function that is provided in the ’c64plus\dsplib\src\DSP_fft32x32’
directory.
Special Requirements
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
- The FFT coefficients (twiddle factors) are generated using the function
gen_twiddle_fft32x32 provided in the ’c64plus\dsplib\src\DSP_fft32x32’
directory. The scale factor must be 2147483647.5 No scaling is done with
the function; thus the input data must be scaled by 2log2(nx) to completely
prevent overflow.
4-24
DSP_fft32x32
Implementation Notes
Function void DSP_fft32x32s(int * restrict w, int nx, int * restrict x, int * restrict y)
Description This routine computes an extended precision complex forward mixed radix
FFT with scaling, rounding and digit reversal. Input data x[ ], output data y[ ],
and coefficients w[ ] are 32-bit. The output is returned in the separate array y[ ]
in normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache. The C code to generate the twiddle factors is provided with
this library in the ’c64plus\dsplib\src\DSP_fft32x32s’ directory.
Algorithm For further details, see the source code of the C and Optimized C version of
this function that is provided in the ’c64plus\dsplib\src\DSP_fft32x32s’
directory.
Special Requirements
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
4-26
DSP_fft32x32s
- The FFT coefficients (twiddle factors) are generated using the function
gen_twiddle_fft32x32s provided in the ’c64plus\dsplib\src\DSP_fft32x3s’
directory. The scale factor must be 1073741823.5. No scaling is done with
the function; thus the input data must be scaled by 2(log2(nx) − ceil[log4(nx)−1])
to completely prevent overflow.
Implementation Notes
Function void DSP_ifft16x16(short * restrict w, int nx, short * restrict x, short * restrict y)
Description This routine computes a complex inverse mixed radix IFFT with rounding and
digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.
The output is returned in the separate array y[ ] in normal order. Each complex
value is stored with interleaved real and imaginary parts. The code uses a
special ordering of IFFT coefficients (also called twiddle factors) and memory
accesses to improve performance in the presence of cache.
The fft16x16 can be used to perform IFFT, by first conjugating the input,
performing the FFT, and conjugating again. This allows fft16x16 to perform the
IFFT as well. However, if the double conjugation needs to be avoided, then this
routine uses the same twiddle factors as the FFT and performs an IFFT. The
change in the sign of the twiddle factors is adjusted for in the routine. Hence,
this routine uses the same twiddle factors as the fft16x16 routine.
Algorithm For further details, see the source code of the C and Optimized C version of
this function that is provided in the ’c64plus\dsplib\src\DSP_ifft16x16’
directory.
Special Requirements
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
- Scaling by two is performed after each radix-4 stage except the last one.
Implementation Notes
4-28
DSP_ifft16x16
DSP_ifft16x16_imre Complex Inverse Mixed Radix 16 x 16-bit FFT With Im/Re Order
Description This routine computes a complex inverse mixed radix IFFT with rounding and
digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.
The output is returned in the separate array y[ ] in normal order. Each complex
value is stored with interleaved imaginary and real parts. The code uses a
special ordering of IFFT coefficients (also called twiddle factors) and memory
accesses to improve performance in the presence of cache.
The fft16x16_imre can be used to perform IFFT, by first conjugating the input,
performing the FFT, and conjugating again. This allows fft16x16_imre to
perform the IFFT as well. However, if the double conjugation needs to be
avoided, then this routine uses the same twiddle factors as the FFT and
performs an IFFT. The change in the sign of the twiddle factors is adjusted for
in the routine. Hence, this routine uses the same twiddle factors as the
fft16x16_imre routine.
Algorithm For further details, see the source code of the C and Optimized C version of
this function that is provided in the ’c64plus\dsplib\src\DSP_ifft16x16_imre’
directory.
Special Requirements
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the imaginary/real
components stored in adjacent locations in the array. The imaginary
components are stored at even array indices, and the real components are
stored at odd array indices.
- Scaling by two is performed after each radix-4 stage except the last one.
4-30
DSP_ifft16x16_imre
Implementation Notes
Function void DSP_ifft16x32(short * restrict w, int nx, int * restrict x, int * restrict y)
Description This routine computes an extended precision complex inverse mixed radix
FFT with rounding and digit reversal. Input data x[ ] and output data y[ ] are
32-bit, coefficients w[ ] are 16-bit. The output is returned in the separate array
y[ ] in normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache.
Algorithm For further details, see the source code of the C and Optimized C version of
this function that is provided in the ’c64plus\dsplib\src\DSP_ifft16x32’
directory.
Special Requirements
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
4-32
DSP_ifft16x32
- The FFT coefficients (twiddle factors) are generated using the function
gen_twiddle_ifft16x32 provided in the ’c64plus\dsplib\src\DSP_ifft16x32’
directory. The scale factor must be 32767.5. No scaling is done with the
function; thus the input data must be scaled by 2log2(nx) to completely
prevent overflow.
Implementation Notes
Function void DSP_ifft32x32(int * restrict w, int nx, int * restrict x, int * restrict y)
Description This routine computes an extended precision complex inverse mixed radix
FFT with rounding and digit reversal. Input data x[ ], output data y[ ], and
coefficients w[ ] are 32-bit. The output is returned in the separate array y[ ] in
normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache.
Algorithm For further details, see the source code of the C and Optimized C version of
this function that is provided in the ’c64plus\dsplib\src\DSP_ifft32x32’
directory.
Special Requirements
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
4-34
DSP_ifft32x32
- The FFT coefficients (twiddle factors) are generated using the function
gen_twiddle_ifft32x32 provided in the ’c64plus\dsplib\src\DSP_ifft32x32’
directory. The scale factor must be 2147483647.5. No scaling is done with
the function; thus the input data must be scaled by 2log2(nx) to completely
prevent overflow.
Implementation Notes
Function void DSP_fir_cplx (short * restrict x, short * restrict h, short * restrict r, int nh,
int nr)
Description This function implements the FIR filter for complex input data. The filter has
nr output samples and nh coefficients. Each array consists of an even and odd
term with even terms representing the real part and the odd terms the
imaginary part of the element. The pointer to input array x must point to the
(nh)th complex sample; i.e., element 2*(nh−1), upon entry to the function. The
coefficients are expected in normal order.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_fir_cplx(short *x, short *h, short *r,short nh, short
nr)
{
short i,j;
int imag, real;
for (i = 0; i < 2*nr; i += 2){
imag = 0;
real = 0;
for (j = 0; j < 2*nh; j += 2){
real += h[j] * x[i−j] − h[j+1] * x[i+1−j];
imag += h[j] * x[i+1−j] + h[j+1] * x[i−j];
}
r[i] = (real >> 15);
4-36
DSP_fir_cplx
Special Requirements
Implementation Notes
Description This function implements the FIR filter for complex input data. The filter has
nr output samples and nh coefficients. Each array consists of an even and odd
term with even terms representing the real part and the odd terms the
imaginary part of the element. The pointer to input array x must point to the
(nh)th complex sample; i.e., element 2*(nh−1), upon entry to the function. The
coefficients are expected in normal order.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_fir_cplx(short *x, short *h, short *r,short nh, short
nr)
{
short i,j;
int imag, real;
for (i = 0; i < 2*nr; i += 2){
imag = 0;
real = 0;
for (j = 0; j < 2*nh; j += 2){
real += h[j] * x[i−j] − h[j+1] * x[i+1−j];
imag += h[j] * x[i+1−j] + h[j+1] * x[i−j];
}
r[i] = (real >> 15);
r[i+1] = (imag >> 15);
}
}
4-38
DSP_fir_cplx_hM4X4
Special Requirements
Implementation Notes
Function void DSP_fir_gen (short * restrict x, short * restrict h, short * restrict r, int nh,
int nr)
Description Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. It operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_fir_gen(short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
4-40
DSP_fir_gen
Special Requirements
Implementation Notes
nh Number of coefficients.
Description Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. It operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_fir_gen(short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
4-42
DSP_fir_gen_hM17_rA8X8
Special Requirements
Implementation Notes
Function void DSP_fir_r4 (short * restrict x, short * restrict h, short * restrict r, int nh,
int nr)
Description Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. This FIR operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_fir_r4(short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
4-44
DSP_fir_r4
Special Requirements
Implementation Notes
Function void DSP_fir_r8 (short * restrict x, short * h, short * restrict r, int nh, int nr)
Description Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. This FIR operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
Special Requirements
4-46
DSP_fir_r8
Implementation Notes
Description Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. This FIR operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
4-48
DSP_fir_r8_hM16_rM8A8X8
Special Requirements
Implementation Notes
Function void DSP_fir_sym (short * restrict x, short * restrict h, short * restrict r, int nh,
int nr, int s)
Description This function applies a symmetric filter to the input samples. The filter tap array
h[] provides ‘nh+1’ total filter taps. The filter tap at h[nh] forms the center point
of the filter. The taps at h[nh − 1] through h[0] form a symmetric filter about this
central tap. The effective filter length is thus 2*nh+1 taps.
The filter is performed on 16-bit data with 16-bit coefficients, accumulating
intermediate results to 40-bit precision. The accumulator is rounded and
truncated according to the value provided in ‘s’. This allows a variety of
Q-points to be used.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_fir_sym(short *x, short *h, short *r, int nh, int nr,
int s)
{
int i, j;
long y0;
long round = (long) 1 << (s − 1);
for (j = 0; j < nr; j++) {
y0 = round;
4-50
DSP_fir_sym
Special Requirements
- nr must be a multiple of 4.
Implementation Notes
nh Number of coefficients.
Description This function implements an IIR filter, with a number of biquad stages given
by nh / 4. It accepts a single sample of input and returns a single sample of
output. Coefficients are expected to be in the range [−2.0, 2.0) with Q14
precision.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
short DSP_iir (short Input, const short * Coefs, int
nCoefs, short * State)
{
int x, p0, p1, i, j;
x = (int) Input;
for (i = j = 0; i < nCoefs; i += 4, j += 2) {
p0 = Coefs[i + 2] * State[j] + Coefs[i + 3] * State[j +
1];
p1 = Coefs[i] * State[j] + Coefs[i + 1] * State[j + 1];
State[j + 1] = State[j];
return x;
}
4-52
DSP_iir
Special Requirements
Implementation Notes
Benchmarks Cycles 4 * nr + 15
Codesize 192 bytes
Function void DSP_iir_lat(short * restrict x, int nx, short * restrict k, int nk, int * restrict
b, short * restrict r)
Description This routine implements a real all-pole IIR filter in lattice structure (AR lattice).
The filter consists of nk lattice stages. Each stage requires one reflection
coefficient k and one delay element b. The routine takes an input vector x[] and
returns the filter output in r[]. Prior to the first call of the routine, the delay
elements in b[] should be set to zero. The input data may have to be pre-scaled
to avoid overflow or achieve better SNR. The reflections coefficients lie in the
range −1.0 < k < 1.0. The order of the coefficients is such that k[nk−1]
corresponds to the first lattice stage after the input and k[0] corresponds to the
last stage.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void iirlat(short *x, int nx, short *k, int nk, int *b,
short *r)
{
int rt; /* output */
int i, j;
4-54
DSP_iir_lat
{
rt = rt − (short)(b[i] >> 15) * k[i];
b[i + 1] = b[i] + (short)(rt >> 15) * k[i];
}
b[0] = rt;
r[j] = rt >> 15;
}
}
Special Requirements
- nk must be >= 4.
- No special alignment requirements
Implementation Notes
4.5 Math
Function int DSP_dotp_sqr(int G, short * restrict x, short * restrict y, int * restrict r, int nx)
Description This routine performs an nx element dot product of x[ ] and y[ ] and stores it
in r. It also squares each element of y[ ] and accumulates it in G. G is passed
back to the calling function in register A4. This computation of G is used in the
VSELP coder.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
int DSP_dotp_sqr (int G,short *x,short *y,int *r,
int nx)
{
short *y2;
short *endPtr2;
y2 = x;
for (endPtr2 = y2 + nx; y2 < endPtr2; y2++){
*r += *y * *y2;
G += *y * *y;
y++;
}
return(G);
}
4-56
DSP_dotp_sqr
Implementation Notes
- Interruptibility: The code is interruptible.
Description This routine takes two vectors and calculates their dot product. The inputs are
16-bit short data and the output is a 32-bit number.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
int DSP_dotprod(short x[ ],short y[ ], int nx)
{
int sum;
int i;
sum = 0;
for(i=0; i<nx; i++){
sum += (x[i] * y[i]);
}
return (sum);
}
Special Requirements
4-58
DSP_dotprod
Implementation Notes
Benchmarks Cycles nx / 4 + 19
Codesize 96 bytes
Description This routine finds the element with maximum value in the input vector and
returns that value.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
short DSP_maxval(short x[ ], int nx)
{
int i, max;
max = −32768;
Implementation Notes
- Interruptibility: The code is interruptible.
Benchmarks Cycles nx / 8 + 13
Codesize 125 bytes
4-60
DSP_maxidx
Arguments x[nx] Pointer to input vector of size nx. Must be double-word aligned.
Description This routine finds the max value of a vector and returns the index of that value.
The input array is treated as 16 separate columns that are interleaved
throughout the array. If values in different columns are equal to the maximum
value, then the element in the leftmost column is returned. If two values within
a column are equal to the maximum, then the one with the lower index is
returned. Column takes precedence over index.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
int DSP_maxidx(short x[ ], int nx)
{
int max, index, i;
max = −32768;
for (i = 0; i < nx; i++)
if (x[i] > max) {
max = x[i];
index = i;
}
return index;
}
Special Requirements
Implementation Notes
- The code is unrolled 16 times to enable the full bandwidth of LDDW and
MAX2 instructions to be utilized. This splits the search into 16 sub-ranges.
The global maximum is then found from the list of maximums of the
sub-ranges. Then, using this offset from the sub-ranges, the global
maximum and the index of it are found using a simple match. For common
maximums in multiple ranges, the index will be different to the above C
code.
Benchmarks Cycles 9 * nx / 64 + 70
Codesize 320 bytes
4-62
DSP_minval
Description This routine finds the minimum value of a vector and returns the value.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
short DSP_minval(short x[ ], int nx)
{
int i, min;
min = 32767;
Implementation Notes
- The input data is loaded using double word wide loads, and the MIN2
instruction is used to get to the minimum.
Function void DSP_mul32(int * restrict x, int * restrict y, int * restrict r, short nx)
Arguments x[nx] Pointer to input data vector 1 of size nx. Must be double-word
aligned.
Description The function performs a Q.31 x Q.31 multiply and returns the upper 32 bits of
the result. The result of the intermediate multiplies are accumulated into a
40-bit long register pair, as there could be potential overflow. The contribution
of the multiplication of the two lower 16-bit halves are not considered. The
output is in Q.30 format. Results are accurate to least significant bit.
Algorithm In the comments below, X and Y are the input values. Xhigh and Xlow
represent the upper and lower 16 bits of X. This is the natural C equivalent of
the optimized intrinsic C code without restrictions. Note that the intrinsic C
code is optimized and restrictions may apply.
void DSP_mul32(const int *x, const int *y, int *r,
short nx)
{
short i;
int a,b,c,d,e;
for(i=nx;i>0;i−−)
{
a=*(x++);
b=*(y++);
c=_mpyluhs(a,b); /* Xlow*Yhigh */
d=_mpyhslu(a,b); /* Xhigh*Ylow */
e=_mpyh(a,b); /* Xhigh*Yhigh */
d+=c; /* Xhigh*Ylow+Xlow*Yhigh */
d=d>>16; /* (Xhigh*Ylow+Xlow*Yhigh)>>16 */
4-64
DSP_mul32
e+=d; /* Xhigh*Yhigh + */
/* (Xhigh*Ylow+Xlow*Yhigh)>>16 */
*(r++)=e;
}
}
Special Requirements
Implementation Notes
Arguments x[nx] Pointer to input data vector 1 of size nx with 32-bit elements.
Must be double-word aligned.
r[nx] Pointer to output data vector of size nx with 32-bit elements.
Must be double-word aligned.
nx Number of elements of input and output vectors. Must be a
multiple of 4 and ≥8.
Description This function negates the elements of a vector (32-bit elements). The input and
output arrays must not be overlapped except for where the input and output
pointers are exactly equal.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_neg32(int *x, int *r, short nx)
{
short i;
for(i=nx; i>0; i−−)
*(r++)=−*(x++);
}
Special Requirements
Implementation Notes
- Interruptibility: The code is interruptible.
4-66
DSP_recip16
Function void DSP_recip16(short * restrict x, short * restrict rfrac, short * restrict rexp,
short nx)
Description This routine returns the fractional and exponential portion of the reciprocal of
an array x[ ] of Q.15 numbers. The fractional portion rfrac is returned in Q.15
format. Since the reciprocal is always greater than 1, it returns an exponent
such that:
(rfrac[i] * 2rexp[i]) = true reciprocal
The output is accurate up to the least significant bit of rfrac, but note that this
bit could carry over and change rexp. For a reciprocal of 0, the procedure will
return a fractional part of 7FFFh and an exponent of 16.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_recip16(short *x, short *rfrac, short *rexp, short
nx)
{
int i,j,a,b;
short neg, normal;
for(i=nx; i>0; i−−)
{
a=*(x++);
if(a<0) /* take absolute value */
{
a=−a;
neg=1;
}
else neg=0;
normal=_norm(a); /* normalize number */
a=a<<normal;
*(rexp++)=normal−15; /* store exponent */
b=0x80000000; /* dividend = 1 */
for(j=15;j>0;j−−)
b=_subc(b,a); /* divide */
b=b&0x7FFF; /* clear remainder
/* (clear upper half) */
if(neg) b=−b; /* if originally
/* negative, negate */
*(rfrac++)=b; /* store fraction */
}
}
Implementation Notes
Benchmarks Cycles 9 * nx + 22
Codesize 224 bytes
4-68
DSP_vecsumsq
Description This routine returns the sum of squares of the elements contained in the vector
x[ ].
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
int DSP_vecsumsq(short x[ ], int nx)
{
int i, sum=0;
Special Requirements
Implementation Notes
m Weighting factor
Description This routine is used to obtain the weighted vector sum. Both the inputs and
output are 16-bit numbers.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_w_vec(short x[ ],short y[ ],short m,
short r[ ],short nr)
{
short i;
Special Requirements
Implementation Notes
- Interruptibility: The code is interruptible.
- Input is loaded in double-words.
- Use of packed data processing to sustain throughput.
4-70
DSP_mat_mul
4.6 Matrix
Function void DSP_mat_mul(short * restrict x, int r1, int c1, short * restrict y, int c2, short
* restrict r, int qs)
Description This function computes the expression “r = x * y” for the matrices x and y. The
columnar dimension of x must match the row dimension of y. The resulting
matrix has the same number of rows as x and the same number of columns as
y.
The values stored in the matrices are assumed to be fixed-point or integer
values. All intermediate sums are retained to 32-bit precision, and no overflow
checking is performed. The results are right-shifted by a user-specified
amount, and then truncated to 16 bits.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_mat_mul(short *x, int r1, int c1, short *y, int c2,
short *r, int qs)
{
int i, j, k;
int sum;
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Multiply each row in x by each column in y. The */
4-72
DSP_mat_trans
Function void DSP_mat_trans(short * restrict x, short rows, short columns, short * re-
strict r)
Description This function transposes the input matrix x[ ] and writes the result to matrix r[ ].
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_mat_trans(short *x, short rows, short columns, short
*r)
{
short i,j;
for(i=0; i<columns; i++)
for(j=0; j<rows; j++)
*(r+i*rows+j)=*(x+i+columns*j);
}
Special Requirements
Implementation Notes
- Data from four adjacent rows, spaced “columns” apart are read, and a
local 4x4 transpose is performed in the register file. This leads to four
double words, that are “rows” apart. These loads and stores can cause
bank conflicts; hence, non-aligned loads and stores are used.
4.7 Miscellaneous
return short Return value is the maximum exponent that may be used in
scaling.
Description Computes the exponents (number of extra sign bits) of all values in the input
vector x[ ] and returns the minimum exponent. This will be useful in
determining the maximum shift value that may be used in scaling a block of
data.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
short DSP_bexp(const int *x, short nx)
{
int min_val =_norm(x[0]);
short n;
int i;
for(i=1;i<nx;i++)
{
n =_norm(x[i]); /* _norm(x) = number of */
/* redundant sign bits */
if(n<min_val) min_val=n;
}
return min_val;
}
Special Requirements
- nx must be a multiple of 8.
- The input vector x[ ] must be double-word aligned.
4-74
DSP_bexp
Implementation Notes
- Interruptibility: The code is interruptible.
if (r)
{
_x = (char *)x;
_r = (char *)r;
} else
{
_x = (char *)x;
_r = (char *)r;
}
4-76
DSP_blk_eswap16
Special Requirements
- Input and output arrays do not overlap, except when “r == NULL” so that
the operation occurs in-place.
- The input array and output array are expected to be double-word aligned,
and a multiple of 8 half-words must be processed.
Implementation Notes
- Interruptibility: The code is interruptible.
Benchmarks Cycles nx / 4 + 8
Codesize 192 bytes
Description The data in the x[] array is endian swapped, meaning that the byte-order of the
bytes within each word of the r[] array is reversed. This facilitates moving
big-endian data to a little-endian system or vice-versa.
When the r pointer is non-NULL, the endian-swap occurs out-of-place, similar
to a block move. When the r pointer is NULL, the endian-swap occurs in-place,
allowing the swap to occur without using any additional storage.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_blk_eswap32(void *x, void *r, int nx)
{
int i;
char *_x, *_r;
if (r)
{
_x = (char *)x;
_r = (char *)r;
} else
{
_x = (char *)x;
_r = (char *)r;
}
4-78
DSP_blk_eswap32
t2 = _x[i*4 + 1];
t3 = _x[i*4 + 0];
_r[i*4 + 0] = t0;
_r[i*4 + 1] = t1;
_r[i*4 + 2] = t2;
_r[i*4 + 3] = t3;
}
}
Special Requirements
- Input and output arrays do not overlap, except where “r == NULL” so that
the operation occurs in-place.
- The input array and output array are expected to be double-word aligned,
and a multiple of 4 words must be processed.
Implementation Notes
- Interruptibility: The code is interruptible.
Benchmarks Cycles nx / 2 + 11
Codesize 224 bytes
Description The data in the x[] array is endian swapped, meaning that the byte-order of the
bytes within each double-word of the r[] array is reversed. This facilitates
moving big-endian data to a little-endian system or vice-versa.
When the r pointer is non-NULL, the endian-swap occurs out-of-place, similar
to a block move. When the r pointer is NULL, the endian-swap occurs in-place,
allowing the swap to occur without using any additional storage.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_blk_eswap64(void *x, void *r, int nx)
{
int i;
char *_x, *_r;
if (r)
{
_x = (char *)x;
_r = (char *)r;
} else
{
_x = (char *)x;
_r = (char *)r;
}
4-80
DSP_blk_eswap64
t2 = _x[i*8 + 5];
t3 = _x[i*8 + 4];
t4 = _x[i*8 + 3];
t5 = _x[i*8 + 2];
t6 = _x[i*8 + 1];
t7 = _x[i*8 + 0];
_r[i*8 + 0] = t0;
_r[i*8 + 1] = t1;
_r[i*8 + 2] = t2;
_r[i*8 + 3] = t3;
_r[i*8 + 4] = t4;
_r[i*8 + 5] = t5;
_r[i*8 + 6] = t6;
_r[i*8 + 7] = t7;
}
}
Special Requirements
- Input and output arrays do not overlap, except when “r == NULL” so that
the operation occurs in-place.
- The input array and output array are expected to be double-word aligned,
and a multiple of 2 double-words must be processed.
Implementation Notes
- Interruptibility: The code is interruptible.
Benchmarks Cycles nx + 11
Codesize 224 bytes
Description This routine moves nx 16-bit elements from one memory location pointed to
by x to another pointed to by r. The source and destination blocks can be
overlapped.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_blk_move(short *x, short *r, int nx)
{
int i;
if( r < x )
{
for (I = 0; I < nx; i++)
r[i] = x[i];
} else
{
for (I = nx−1; I >= 0; i−−)
r[i] = x[i];
}
}
Special Requirements
Implementation Notes
Benchmarks Cycles nx / 4 + 6
Codesize 64 bytes
4-82
DSP_fltoq15
Arguments x[nx] Pointer to floating-point input vector of size nx. x should contain
the numbers normalized between [−1,1).
Description Convert the IEEE floating point numbers stored in vector x[ ] into Q.15 format
numbers stored in vector r[ ]. Results are truncated toward zero. Values that
exceed the size limit will be saturated to 0x7fff if value is positive and 0x8000
if value is negative. All values too small to be correctly represented will be
truncated to 0.
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void fltoq15(float x[], short r[], short nx)
{
int i, a;
// saturate to 16−bit //
if (a>32767) a = 32767;
if (a<−32768) a = −32768;
r[i] = (short) a;
}
}
Implementation Notes
Benchmarks Cycles 2 * nx + 10
Codesize 192 bytes
4-84
DSP_minerror
Function int minerror (short * restrict GSP0_TABLE, short * restrict errCoefs, int * restrict
max_index)
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
int minerr
(
const short *restrict GSP0_TABLE,
const short *restrict errCoefs,
int *restrict max_index
)
{
int val, maxVal = −50;
int i, j;
for (i = 0; i < GSP0_NUM; i++)
{
for (val = 0, j = 0; j < GSP0_TERMS; j++)
val += GSP0_TABLE[i*GSP0_TERMS+j] * errCoefs[j];
Implementation Notes
Benchmarks Cycles 2 * nx + 10
Codesize 1120 bytes
4-86
DSP_q15tofl
Description Converts the values stored in vector x[ ] in Q.15 format to IEEE floating point
numbers in output vector r[ ].
Algorithm This is the natural C equivalent of the optimized intrinsic C code without
restrictions. Note that the intrinsic C code is optimized and restrictions may
apply.
void DSP_q15tofl(short *x, float *r, int nx)
{
int i;
for (i=0;i<nx;i++)
r[i] = (float) x[i] / 0x8000;
}
Implementation Notes
- Interruptibility: The code is interruptible.
- Loop is unrolled twice
Benchmarks Cycles 9 * nx / 4 + 12
Codesize 704 bytes
This appendix describes performance considerations related to the C64x+ DSPLIB and provides
information about the Q format used by DSPLIB functions.
Topic Page
A-1
Performance Considerations
A-2
Fractional Q Formats
Bit 15 14 13 12 … 3 2 1 0
Value S Q30 Q29 Q28 … Q19 Q18 Q17 Q16
A-4
Appendix
AppendixBA
This appendix provides information about software updates and customer support.
Topic Page
B-1
DSPLIB
DSPLIB Software
Software Updates
Updates / DSPLIB Customer Support
B-2
Appendix
AppendixCA
# $
A
address: The location of program code or data stored; an individually
accessible memory location.
assert: To make a digital logic device pin active. If the pin is active low, then
a low voltage on the pin asserts it. If the pin is active high, then a high
voltage asserts it.
B
bit: A binary digit, either a 0 or 1.
big endian: An addressing protocol in which bytes are numbered from left
to right within a word. More significant bytes in a word have lower
numbered addresses. Endian ordering is specific to hardware and is
determined at reset. See also little endian.
block: The three least significant bits of the program address. These
correspond to the address within a fetch packet of the first instruction
being addressed.
C-1
Glossary
C
cache: A fast storage buffer in the central processing unit of a computer.
cache controller: System component that coordinates program accesses
between CPU program fetch mechanism, cache, and external memory.
CCS: Code Composer Studio.
central processing unit (CPU): The portion of the processor involved in
arithmetic, shifting, and Boolean logic operations, as well as the
generation of data- and program-memory addresses. The CPU includes
the central arithmetic logic unit (CALU), the multiplier, and the auxiliary
register arithmetic unit (ARAU).
chip support library (CSL): The CSL is a set of application programming
interfaces (APIs) consisting of target side DSP code used to configure
and control all on-chip peripherals.
clock cycle: A periodic or sequence of events based on the input from the
external clock.
clock modes: Options used by the clock generator to change the internal
CPU clock frequency to a fraction or multiple of the frequency of the input
clock signal.
code: A set of instructions written to perform a task; a computer program or
part of a program.
coder-decoder or compression/decompression (codec): A device that
codes in one direction of transmission and decodes in another direction
of transmission.
compiler: A computer program that translates programs in a high-level
language into their assembly-language equivalents.
C-2
Glossary
control register: A register that contains bit fields that define the way a
device operates.
D
device ID: Configuration register that identifies each peripheral component
interconnect (PCI).
DMA source: The module where the DMA data originates. DMA data is read
from the DMA source.
DMA transfer: The process of transferring data from one part of memory to
another. Each DMA transfer consists of a read bus cycle (source to DMA
holding register) and a write bus cycle (DMA holding register to
destination).
DSP_autocor: Autocorrelation.
Glossary C-3
Glossary
DSP_fft16x32: Complex forward mixed radix 16- x 32-bit FFT with rounding.
DSP_fft32x32: Complex forward mixed radix 32- x 32-bit FFT with rounding.
DSP_fft32x32s: Complex forward mixed radix 32- x 32-bit FFT with scaling.
C-4
Glossary
E
evaluation module (EVM): Board and software tools that allow the user to
evaluate a specific device.
external interrupt: A hardware interrupt triggered by a specific value on a
pin.
external memory interface (EMIF): Microprocessor hardware that is used
to read to and write from off-chip memory.
F
fast Fourier transform (FFT): An efficient method of computing the discrete
Fourier transform algorithm, which transforms functions between the
time domain and the frequency domain.
fetch packet: A contiguous 8-word series of instructions fetched by the CPU
and aligned on an 8-word boundary.
FFT: See fast fourier transform.
flag: A binary status indicator whose state indicates whether a particular
condition has occurred or is in effect.
frame: An 8-word space in the cache RAMs. Each fetch packet in the cache
resides in only one frame. A cache update loads a frame with the
requested fetch packet. The cache contains 512 frames.
G
global interrupt enable bit (GIE): A bit in the control status register (CSR)
that is used to enable or disable maskable interrupts.
Glossary C-5
Glossary
H
HAL: Hardware abstraction layer of the CSL. The HAL underlies the service
layer and provides it a set of macros and constants for manipulating the
peripheral registers at the lowest level. It is a low-level symbolic interface
into the hardware providing symbols that describe peripheral
registers/bitfields, and macros for manipulating them.
host: A device to which other devices (peripherals) are connected and that
generally controls those devices.
host port interface (HPI): A parallel interface that the CPU uses to
communicate with a host processor.
I
index: A relative offset in the program address that specifies which frame is
used out of the 512 frames in the cache into which the current access is
mapped.
C-6
Glossary
L
least significant bit (LSB): The lowest-order bit in a word.
linker: A software tool that combines object files to form an object module,
which can be loaded into memory and executed.
little endian: An addressing protocol in which bytes are numbered from right
to left within a word. More significant bytes in a word have
higher-numbered addresses. Endian ordering is specific to hardware
and is determined at reset. See also big endian.
M
maskable interrupt: A hardware interrupt that can be enabled or disabled
through software.
Glossary C-7
Glossary
N
nonmaskable interrupt (NMI): An interrupt that can be neither masked nor
disabled.
O
object file: A file that has been assembled or linked and contains machine
language object code.
P
peripheral: A device connected to and usually controlled by a host device.
R
random-access memory (RAM): A type of memory device in which the
individual locations can be accessed in any order.
C-8
Glossary
reset: A means of bringing the CPU to a known state by setting the registers
and control bits to predetermined values and signaling execution to start
at a specified address.
S
service layer: The top layer of the 2-layer chip support library architecture
providing high-level APIs into the CSL and BSL. The service layer is
where the actual APIs are defined and is the interface layer.
system software: The blanketing term used to denote collectively the chip
support libraries and board support libraries.
T
tag: The 18 most significant bits of the program address. This value
corresponds to the physical address of the fetch packet that is in that
frame.
TIMER module: TIMER is an API module used for configuring the timer
registers.
W
word: A multiple of eight bits that is operated upon as a unit. For the C6x,
a word is 32 bits in length.
Glossary C-9
C-10
%
Index-1
Index
Index-2
DSP_vecsumsq DSP_bitrev_cplx 4-90
defined C-5 DSP_blk_move 4-78, 4-80, 4-82, 4-84
DSPLIB reference 4-71 DSP_dotp_sqr 4-58
DSP_dotprod 4-60
DSP_w_vec
DSP_fft 4-98
defined C-5 DSP_fft16x16r 4-14
DSPLIB reference 4-72 DSP_fft16x16t 4-8, 4-11, 4-107
DSPLIB DSP_fft16x32 4-24
argument conventions, table 3-2 DSP_fft32x32 4-26
arguments 2-3 DSP_fft32x32s 4-28
arguments and data types 2-3 DSP_fir_cplx 4-38, 4-40
calling a function from Assembly 2-4 DSP_fir_gen 4-42, 4-44
calling a function from C 2-4 DSP_firlms2 4-2
customer support B-2 DSP_fir_r4 4-46
data types, table 2-3 DSP_fir_r8 4-48, 4-50
features and benefits 1-4 DSP_fir_sym 4-52
fractional Q formats A-3 DSP_fltoq15 4-85
DSP_ifft16x32 4-30, 4-32, 4-34
functional categories 1-2
DSP_ifft32x32 4-36
functions 3-3
DSP_iir 4-54
adaptive filtering 3-4
DSP_iirlat 4-56
correlation 3-4
DSP_lat_fwd 4-56
FFT (fast Fourier transform) 3-4
DSP_mat_trans 4-75
filtering and convolution 3-5
DSP_maxidx 4-63
math 3-6 DSP_maxval 4-62
matrix 3-6 DSP_minerror 4-87
miscellaneous 3-7 DSP_minval 4-65
how DSPLIB deals with overflow and DSP_mmul 4-73
scaling 2-4, 2-5 DSP_mul32 4-66
how to install 2-2 DSP_neg32 4-68
how to rebuild DSPLIB 2-5 DSP_q15tofl 4-89
introduction 1-2 DSP_r4fft 4-95
lib directory 2-2 DSP_radix2 4-93
performance considerations A-2 DSP_recip16 4-69
Q.3.12 bit fields A-3 DSP_vecsumsq 4-71
Q.3.12 format A-3 DSP_w_vec 4-72
Q.3.15 bit fields A-3 FFT functions 4-8
Q.3.15 format A-3 filtering and convolution functions 4-38
Q.31 format A-4 math functions 4-58
Q.31 high-memory location bit fields A-4 matrix functions 4-73
Q.31 low-memory location bit fields A-4 miscellaneous functions 4-76
reference 4-1
software updates B-2
testing, how DSPLIB is tested 2-4
using DSPLIB 2-3 E
DSPLIB reference
adaptive filtering functions 4-2 evaluation module, defined C-5
correlation functions 4-4
DSP_autocor 4-4, 4-6 external interrupt, defined C-5
DSP_bexp 4-76 external memory interface (EMIF), defined C-5
Index-3
Index
F L
fetch packet, defined C-5 least significant bit (LSB), defined C-7
FFT (fast Fourier transform) lib directory 2-2
defined C-5 linker, defined C-7
functions 3-4 little endian, defined C-7
FFT (fast Fourier transform) functions,
DSPLIB reference 4-8
filtering and convolution functions 3-5 M
DSPLIB reference 4-38
maskable interrupt, defined C-7
flag, defined C-5
math functions 3-6
fractional Q formats A-3 DSPLIB reference 4-58
frame, defined C-5 matrix functions 3-6
function DSPLIB reference 4-73
calling a DSPLIB function from Assembly 2-4 memory map, defined C-7
calling a DSPLIB function from C 2-4
memory-mapped register, defined C-7
functions, DSPLIB 3-3
miscellaneous functions 3-7
DSPLIB reference 4-76
Index-4
Q S
service layer, defined C-9
Q.3.12 bit fields A-3
software updates B-2
Q.3.12 format A-3 STDINC module, defined C-9
Q.3.15 bit fields A-3 synchronous-burst static random-access memory
(SBSRAM), defined C-9
Q.3.15 format A-3
synchronous dynamic random-access memory
Q.31 format A-4 (SDRAM), defined C-9
Q.31 high-memory location bit fields A-4 syntax, defined C-9
Q.31 low-memory location bit fields A-4 system software, defined C-9
T
R tag, defined C-9
testing, how DSPLIB is tested 2-4
timer, defined C-9
random-access memory (RAM), defined C-8
TIMER module, defined C-9
rebuilding DSPLIB 2-5
reduced-instruction-set computer (RISC),
defined C-8
U
register, defined C-8 using DSPLIB 2-3
reset, defined C-9
routines, DSPLIB functional categories 1-2 W
RTOS, defined C-9 word, defined C-9
Index-5
IMPORTANT NOTICE
Texas Instruments Incorporated and its subsidiaries (TI) reserve the right to make corrections, modifications, enhancements, improvements,
and other changes to its products and services at any time and to discontinue any product or service without notice. Customers should
obtain the latest relevant information before placing orders and should verify that such information is current and complete. All products are
sold subject to TI’s terms and conditions of sale supplied at the time of order acknowledgment.
TI warrants performance of its hardware products to the specifications applicable at the time of sale in accordance with TI’s standard
warranty. Testing and other quality control techniques are used to the extent TI deems necessary to support this warranty. Except where
mandated by government requirements, testing of all parameters of each product is not necessarily performed.
TI assumes no liability for applications assistance or customer product design. Customers are responsible for their products and
applications using TI components. To minimize the risks associated with customer products and applications, customers should provide
adequate design and operating safeguards.
TI does not warrant or represent that any license, either express or implied, is granted under any TI patent right, copyright, mask work right,
or other TI intellectual property right relating to any combination, machine, or process in which TI products or services are used. Information
published by TI regarding third-party products or services does not constitute a license from TI to use such products or services or a
warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual
property of the third party, or a license from TI under the patents or other intellectual property of TI.
Reproduction of TI information in TI data books or data sheets is permissible only if reproduction is without alteration and is accompanied
by all associated warranties, conditions, limitations, and notices. Reproduction of this information with alteration is an unfair and deceptive
business practice. TI is not responsible or liable for such altered documentation. Information of third parties may be subject to additional
restrictions.
Resale of TI products or services with statements different from or beyond the parameters stated by TI for that product or service voids all
express and any implied warranties for the associated TI product or service and is an unfair and deceptive business practice. TI is not
responsible or liable for any such statements.
TI products are not authorized for use in safety-critical applications (such as life support) where a failure of the TI product would reasonably
be expected to cause severe personal injury or death, unless officers of the parties have executed an agreement specifically governing
such use. Buyers represent that they have all necessary expertise in the safety and regulatory ramifications of their applications, and
acknowledge and agree that they are solely responsible for all legal, regulatory and safety-related requirements concerning their products
and any use of TI products in such safety-critical applications, notwithstanding any applications-related information or support that may be
provided by TI. Further, Buyers must fully indemnify TI and its representatives against any damages arising out of the use of TI products in
such safety-critical applications.
TI products are neither designed nor intended for use in military/aerospace applications or environments unless the TI products are
specifically designated by TI as military-grade or "enhanced plastic." Only products designated by TI as military-grade meet military
specifications. Buyers acknowledge and agree that any such use of TI products which TI has not designated as military-grade is solely at
the Buyer's risk, and that they are solely responsible for compliance with all legal and regulatory requirements in connection with such use.
TI products are neither designed nor intended for use in automotive applications or environments unless the specific TI products are
designated by TI as compliant with ISO/TS 16949 requirements. Buyers acknowledge and agree that, if they use any non-designated
products in automotive applications, TI will not be responsible for any failure to meet such requirements.
Following are URLs where you can obtain information on other Texas Instruments products and application solutions:
Products Applications
Amplifiers amplifier.ti.com Audio www.ti.com/audio
Data Converters dataconverter.ti.com Automotive www.ti.com/automotive
DSP dsp.ti.com Broadband www.ti.com/broadband
Clocks and Timers www.ti.com/clocks Digital Control www.ti.com/digitalcontrol
Interface interface.ti.com Medical www.ti.com/medical
Logic logic.ti.com Military www.ti.com/military
Power Mgmt power.ti.com Optical Networking www.ti.com/opticalnetwork
Microcontrollers microcontroller.ti.com Security www.ti.com/security
RFID www.ti-rfid.com Telephony www.ti.com/telephony
RF/IF and ZigBee® Solutions www.ti.com/lprf Video & Imaging www.ti.com/video
Wireless www.ti.com/wireless
Mailing Address: Texas Instruments, Post Office Box 655303, Dallas, Texas 75265
Copyright © 2008, Texas Instruments Incorporated