Lab_Notes_6
Lab_Notes_6
Contents
1 Introduction 1
2 The Discrete Fourier Transform (DFT) 1
3 A Fast Fourier Transform (FFT) Algorithm 2
3.1 An 8-point Decimation-in-Time FFT algorithm . . . . . . . . . . . . . . . . . . . 3
4 Implementing An FFT Algorithm on the C6713 DSK 7
4.1 The Bit-Reversing Algorithm* . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Implementing the Factored Butterfly* . . . . . . . . . . . . . . . . . . . . . . . . 9
5 FIR Filtering using an FFT Algorithm 15
5.1 Real-Time Block Processing on the C6713 DSK* . . . . . . . . . . . . . . . . . . 15
5.2 Linear Convolution using the DFT . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 Overlap and Add Block Convolution FIR Filtering* . . . . . . . . . . . . . . . . . 17
5.4 Implementing the Overlap and Add Algorithm using an FFT* . . . . . . . . . . . . 18
6 End Notes 20
6.1 Advanced Lab Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1 Introduction
The discrete Fourier Transform (DFT) is the only Fourier transform that can be computed
exactly on a digital computer, if by exact we ignore finite precession effects. The DFT, when
implemented directly, requires many complex multiplications and additions. When an N -point
DFT is highly composite (i.e. N is a power of 2), the number of complex multiplications
and additions may be reduced significantly by efficiently factoring the DFT. These factored
DFT algorithms are collectively known as fast Fourier Transform (FFT) algorithms. The FFT
algorithm was introduced to signal processing in 1965 in a paper published by J. W. Cooley
and J. W. Tukey [3]. However, the earliest known discovery of this factorization is credited to
C.F. Gauss in 1805 [5], and in 1965 there actually existed some efficient codes that were nearly
equivalent to the FFT.
In this lab, you will study
• the DFT,
• the in-place decimation-in-time FFT on the DSK using a C language program, and
• FIR filter implementation using the FFT for fast convolution.
The discrete Fourier transform, or DFT, is a one-to-one mapping between an N -point complex
discrete-time sequence {x[n], n = 0, 1, . . . N −1} and an N -point complex Fourier series sequence
1
{X[m], m = 0, 1, . . . N − 1}. The resulting Fourier transform pair is denoted {x[n], nto }N ←→
{X[m], mω
N }N , where
o
N
X −1
X[m] = x[n]WN−mn (analysis) (1)
n=0
NX−1
1
x[n] = X[m]WNmn (synthesis), (2)
N
m=0
2π
where WN = ej N and WNmn are the N roots of unity on the unit circle in the complex plane.
Here, the signal samples are spaced every nto seconds in times, and the FS coefficients, X[m],
are spaced every mωN radians per second in frequency. When no confusion will result, we will
o
k+N/2
Symmetry property: WN = −WNk (3)
Periodicity property: WNk+N = WNk . (4)
When N is a power of two (i.e. N = 2ν , where ν = log2 N is an integer), a radix-2 FFT algorithm
may be used to efficiently compute a DFT. There are two ways of implementing a radix-2
FFT, namely decimation-in-time and decimation-in-frequency. Since these two algorithms are
transposes of each other, only the decimation-in-time algorithm will be derived.
To begin this factorization, the DFT sum in eqn (1) is split into a sum of two N/2 length sums,
one on the even indices of x[n] and one on the odd indices. This process is illustrated below [6,
pg. 635–636]:
1
The complex multiplication of two numbers z1 = x1 + jy1 and z2 = x2 + jy2 is z1 z2 = (x1 x2 − y1 y2 ) + j(x1 y2 +
x2 y1 ), which requires four real multiplications and two real additions. The addition of the two complex numbers
z1 and z2 is z1 + z2 = (x1 + x2 ) + j(y1 + y2 ), which requires two real additions. Note that complex arithmetic
requires “two channel” variables that hold the real and imaginary parts of the complex number.
2
N N
X
2
−1
X
−1
2
n=0 n=0
N N
X
2
−1
X
−1
2
2π
2π (j )
Equation (6) uses the fact that WN2 = e(j N )2 = e N/2 = WN/2 to show how an N -point DFT
may be split into the sum of two N/2-point DFTs, namely G[m] and H[m] in eqn (7). Now, these
N/2 DFTs are periodic in m with period N/2 (i.e. G[m] = G[m + r N2 ] and H[m] = H[m + r N2 ]
for all integers r). This process of grouping the even and odd indexed terms of x[n] into two
N/2 point DFTs is known as time decimation. As a result of this factorization, the number of
2
complex multiplications has been reduced from N 2 to ( N2 )2 + ( N2 )2 + N = N2 + N , where the
extra N multiplications are for multiplying the coefficients H[m] by WN−m , m = 0, 1, . . . , N − 1.
This factorization has reduced the number of complex multiplications roughly in half. This
process may be repeated log2 N times until there are N2 2-point FFT factored butterflies per
stage and log2 N stages. The final factorization uses the symmetry property to reduce the
number of multiplications required in half. The discussion in the following paragraph clarifies
this point.
To begin factoring an 8-point DFT sum, we group the even and odd terms of the sequence
{x[n]}8 and perform two 4-point DFTs as follows [6]:
8
X
X[m] = x[n]W8−mn (8)
n=0
X3 3
X
= x[2n]W4−mn + W8−m x[2n + 1]W4−mn (9)
n=0 n=0
= G[m] + W8−m H[m], k = 0, 1, . . . , 7. (10)
3
P
Here, G[m] = x[2n]W4−mn is the result of a 4-point DFT on the even indices of {x[n]}8 , and
n=0
3
P
H[m] = x[2n + 1]W4−mn is the result of a 4-point DFT on the odd indices of {x[n]}8 . This
n=0
is called decimation-in-time. We may draw the signal flow graph for this once-factored DFT as
shown in Figure 1.
3
G[0]
x[0] X[0]
W80
4-point G[1]
x[2] X[1]
DFT W81
G[2]
x[4] X[2]
W82
G[3]
x[6] X[3]
W83
H[0]
x[1] X[4]
W84
H[1]
x[3] 4-point X[5]
W85
DFT H[2]
x[5] X[6]
W86
H[3]
x[7] X[7]
W87
1 1
!
X X
X[m] = x[4n]W2−mn + W4−m x[4n + 2]W2−mn
n=0 n=0
1 1
!
X X
+ W8−m x[4n + 1]W2−mn + W4−m x[4n + 3]W2−mn , (11)
n=0 n=0
where the four summations are each 2-point sub-DFTs on indices separated by four. Each sum
inside the parenthesis is a 4-point sub-DFTs that amounts to a recombination of two 2-point
sub-DFTs (one on the even indices and one on the odd indices). The sum of the two 4-point
sub-DFTs is an 8-point sub-DFT that amounts to a recombination of the two 4-point sub-DFTs.
Using the facts that W2k = W84k and W4k = W82k along with the fact that W84k is periodic in k
with period 2 and W82k is periodic in k with period 4 yields the final radix-2 FFT factorization
of an 8-point DFT2 . This is shown in Figure 2.
!
2
To further clarify this, note that in eqn (11) when n = 0 in each of the 2-point sums (i.e. sums involving
the time samples x[0], x[1], x[2], and x[3]), W2−mn = W20 = 1, so these time samples are not scaled in the first
stage. When n = 1 in each of the 2-point sums (i.e. sums involving the time samples x[4], x[5], x[6], and x[7]),
W2−mn = W2−m , which is equal to W20 when m is even and W2−1 when m is odd. Bearing in mind that W20 = W80
and W2−1 = W8−4 , use these facts to justify the first stage of Figure 2.
4
x[0] X[0]
W80 W80 W80
x[4] X[1]
W84
W82 W81
x[2] X[2]
W84
W80 W82
x[6] X[3]
W84 W86
W83
x[1] X[4]
W80 W84
W80
x[5] X[5]
W84 W85
W82
x[3] X[6]
W84 W86
W80
x[7] X[7]
W84 W86 W87
5
By inspecting Figure 2, there are 8 log2 8 = 24 complex multiplications (W8−m factors). The
direct calculation of an 8-point DFT would require 82 = 64 complex multiplications. This FFT
factorization has reduced the number of complex multiplications by 40, which means that this
algorithm will run at rate 64/24 = 2.67 times faster than a direct 8-point DFT computation.
When we implement the FFT algorithm shown in Figure 2, we will over-write the set of N
registers from the previous stage. This is known as an in-place algorithm. Implementing an
in-place algorithm will use the minimum number of memory locations, but it will not save the
N input samples. Before we implement an in-place algorithm, we utilize a symmetry condition,
which will further cut the number of complex multiplications in half. Notice that at each stage,
a set of two nodes from one stage is used to compute the same two nodes in the next stage.
Each of these mappings is of the form shown in Figure 3.
(l -1) - th WN− k l - th
stage stage
WN−( k +N / 2 ) = −WN− k
Figure 3: The basic butterfly structure in a radix-2 decimation-in-time FFT algorithm. Adapted
from [6].
WN = −WN . Now, the value at the bottom node in the l-th stage may also be calculated
−(k+N/2) −k
by taking the value at the top node from the (l − 1)-th stage and subtracting the value of the
bottom node of the (l − 1)-th stage scaled by the complex value WN−k . This observation leads
to the in-place butterfly structure shown in Figure 4.
a a ′ = a + WN− k b
(l -1) - th l - th
stage stage
b b′ = a − WN− k b
W N− k -1
When this butterfly structure is implemented in an in-place algorithm, the value in the bottom
node of the (l − 1)-th stage is first copied to a temporary variable which is scaled by a twiddle
factor. This happens before either node has been over-written. Then, the bottom node in
6
the l-th stage of the butterfly is computed and the value over-writes the bottom node. After
the bottom node in the l-th stage is calculated in-place, the value in the temporary variable
is used to calculate the the value of the top node in the l-th stage, which over-writes the top
node. This extra memory requirement is unavoidable in an in-place implementation. However,
in assembly code, this temporary variable would be stored in core register and not saved to a
memory location. The final in-place decimation-in-time signal flow graph for an 8-point FFT
algorithm is shown in Figure 5. Notice that only 12 complex multiplications are required in
this 8-point FFT algorithm. Also, only the first N/2 roots of unity know are required for this
implementation. These roots of unity WNk for k = 0, 1, . . . , (N/2) − 1 are known as the twiddle
factors. In this FFT algorithm, the complex conjugates of the twiddle factors are needed. A
point to be clarified later. In general, an N -point in-place decimation-in-time FFT algorithm
will require N2 log2 (N ) complex multiplications and N log2 (N ) complex additions.
x[0] X[0]
W80
x[4] X[1]
-1
W80
x[2] X[2]
-1
W80 W82
x[6] X[3]
-1 -1
W80
x[1] X[4]
-1
W80 W81
x[5] X[5]
-1 -1
W80 W82
x[3] X[6]
-1 -1
W80 W82 W83
x[7] X[7]
-1 -1 -1
Figure 5: The in-place decimation-in-time signal flow graph for an 8-point FFT algorithm.
Adapted from [6].
The trick in the decimation-in-time algorithm is to organize the N input samples in bit-reversed
order. By this, we mean that we represent the original indices as (log2 N )-bit binary numbers
in chronological order, and then reverse the bits to create the bit-reversed order. In the case
N = 8, the 3-bit bit-reversed order observed in Figure 5 is illustrated in Table 1.
This re-arrangement can be seen in the arrangement of the inputs in the FFT flow graph in
Figure 2.
In this lab, we will create C code to implement an N -point FFT algorithm, where N is a power
of 2. To begin, let’s create the C code function FFT func(COMPLEX *X, COMPLEX *W) that will
take the N input samples, namely {x[n]}N , stored in the complex array X[], and use the N/2
7
n (dec) n (bin) bit-reversed (bin) bit-reversed (dec)
0 000 000 0
1 001 100 4
2 010 010 2
3 011 110 6 .
4 100 001 1
5 101 101 5
6 110 011 3
7 111 111 7
twiddle factors in W[] to compute the N Fourier series coefficients, namely X[k], in-place. This
algorithm will over-write the same N registers at each stage of the algorithm. This process will
be broken up into two sections:
• Arrange the N inputs in bit-reverse order.
• Compute the N/2 butterflies at each of the log2 N stages.
By comparing the binary values in Table 1, one can see that the bit-reversed order may be
calculated by adding a binary one to the most significant bit and having the carryover bits
propagate to the right. To begin this process, start with binary zero and add a binary one to
the most significant bit (i.e. add N/2 in binary to zero). To calculate the next bit reversed
index, test the most significant bit to see if it is a binary zero or binary one. If the bit is binary
zero, add a binary one to it and move on to the next index. If this bit is a binary one, set it to
binary zero and try to add a binary one to the next least significant bit (carryover to the right).
If this bit is a binary zero, add a binary one to it and move to the next index; otherwise, clear
it and move to the next bit and repeat the process. This continues until a binary zero is found
or the last bit-reversed index has been calculated. Due to the way this is implemented, the last
index will be calculated before an overflow to the right occurs.
To visualize this, we will work through a bit reversed sequence for M = 3 (N = 8 as in Table
1). First, we observe that 000(bin) maps to 000(bin) and that 111(bin) maps to 111(bin), so
these indices are not calculated. The first bit-reversed index is N/2, which in our case is equal
to 4. For intuition into the bit-reversing algorithm, we will represent the decimal number 4 as
the binary number 100(bin).
To calculate the next bit-reversed index, we try to add 100(bin) to our previous bit-reversed
index, 100(bin). Since the most significant bit in the previous bit-reversed index is a binary one,
we clear it by subtracting 100(bin) from the number to get 100(bin) - 100(bin) = 000(bin) to
clear the most significant bit. Then, we try to add 010(bin) to 000(bin), which is the result of
the previous subtraction (clearing the most significant bit). Since the next least significant bit
is a binary zero, our new bit-reversed index becomes 000(bin) + 010(bin) = 010(bin).
8
To calculate the next bit-reversed index, we again try to add 100(bin) to the previous bit-
reversed index, 010(bin). Since the most significant bit in 010(bin) is a binary zero, our new
bit-reversed index becomes 010(bin) + 100(bin) = 110(bin).
To calculate the next bit-reversed index, we again try to add 100(bin) to our previous bit-reversed
index, 110(bin). Since the most significant bit is a binary one, we clear it by subtracting 100(bin)
to get 110(bin) - 100(bin) = 010(bin). Then, we try to add 010(bin) to 010(bin). Since the
next least significant bit in 010(bin) is a binary one, we clear it by subtracting 010(bin) to get
010(bin) - 010(bin) = 000(bin). Then, we try to add 001(bin) to 000(bin). Since the next least
significant is a binary zero, our new bit-reversed index becomes 000(bin) + 001(bin) = 001(bin).
This process continues until all N bit-reversed indices have been calculated. In practice, we take
advantage of the fact that the first and last indices map to themselves, so we only calculate the
N − 2 indices in between.
Examine the code for the bit-reversing algorithm and associated data swaps listed in Figure
6. Note that this is not a complete program but just a fragment of code. Observe that the
bit-reversal mapping is a one-to-one mapping, so we can swap the value at the current index
with the value at the bit-reversed index. This can be done in-place using a temporary storage
variable, which is illustrated in lines 34 and 35 in Figure 6. Also, the swapping of values between
the original and bit reversed indices only needs to occur once. To account for this, we test the
current index with the bit-reversed index and only swap if the current index is less than the
bit-reversed index. Then, when the current index is greater than the bit-reversed index, we
know that the values have already been swapped. This is coded in line 32 of Figure 6, where i
is the current index and j is the bit-reversed index of i.
Assignment
1. Work through the description of the 8-point bit-reversing algorithm given on the pre-
vious page using decimal numbers instead of binary numbers. Then, briefly explain
how the C code in lines 17-27 of Figure 6 implement the bit-reversing algorithm. NB:
If k=01100bin (which has decimal equivalent k = 12dec ), then the C command k>>1
returns the binary number 00110bin (which has decimal equivalent 6dec ) meaning it
shifts all of the bits one to the right (and discards the least significant bit). This is the
equivalent of dividing the decimal number k = 12dec by 2 (i.e. 12/2 = 6).
Once the input samples have been put into bit reversed order, we can implement the in-place
decimation-in-time signal flow graph shown in Figure 5. This signal flow graph is implemented
by grouping the butterfly structures at each stage according to their leading twiddle factors.
First, all of the butterfly structures that have WN0 as their leading factor are calculated. Then,
the butterfly structures with WN−r as their leading factor are calculated. Then, the butterfly
structures with WN−2r as their leading factor are calculated, and so on until the N/2 butterfly
structures with the WN−qr as their leading factor have been calculated at a given stage. Here,
qr < N/2, and q and r are non-negative integers. In the first stage, r = N/2 and q may take
on the value {q = 0}, which means that the N/2 butterfly structures with WN0 as their leading
9
1.) // Bit reversing algorithm coded in C
2.) // The input array X (of type COMPLEX)
3.) // N, M, and COMPLEX defined in FFT_header.h
4.)
5.) COMPLEX temp; // temporary storage complex variable swaps
6.) short i; // current sample index
7.) short j; // bit reversed index
8.) short k; // used to propagate carryovers
9.)
10.) short N2 = N/2; // N2 = N >> 1
11.)
12.) // Bit-reversing algorithm. Since 0 -> 0 and N-1 -> N-1
13.) // under bit-reversal,these two reversals are skipped.
14.)
15.) j = 0;
16.)
17.) for(i=1; i<(N-1); i++)
18.) {
19.) k = N2; // k is 1 in msb, 0 elsewhere
20.)
21.) while(k<=j) // Propagate carry to the right if bit is 1
22.) {
23.) j = j - k; // Bit tested is 1, so clear it.
24.) k = k>>1; // Carryover binary 1 to right one bit.
25.) }
26.)
27.) j = j+k; // current bit tested is 0, add 1 to that bit
28.)
29.) // Swap samples if current index is less than bit reversed index.
30.)
31.) if(i<j)
32.) {
33.) temp.real = X[j].real; // Hold value at bit reversed location.
34.) temp.imag = X[j].imag;
35.) X[j].real = X[i].real; // Insert value in current location
36.) X[j].imag = X[i].imag; // at bit reversed location.
37.) X[i].real = temp1.real; // Insert value in bit reversed
38.) X[i].imag = temp1.imag; // location at current location
39.) }
40.) } // end for loop
factor are calculated. In the second stage, r = N/4 and q may take on the values {q = 0, 1},
which means that the N/4 butterfly structures with WN0 as their leading factor are calculated
and then, the N/4 butterfly structures with WN as their leading factor are calculated. In
−N/4
the l-th stage, r = N/(2l ) and q may take on the values {q = 0, 1, . . . , 2l−1 − 1} and the N/2
butterfly structures are calculated accordingly. This continues until all of the N/2 butterfly
structures in each of the M = log2 (N ) stages have been calculated.
The C coded implementation of the decimation-in-time algorithm is shown in Figure 7 (not a
complete program – notice the absence of main(). Note that the array X[] is assumed to be in bit
reversed order (i.e. the input array has been processed by the C code in Figure 6.). Pay particular
attention to lines 26 and 27 in Figure 7. The twiddle factors, namely W [i], i = 0, 1, . . . , (N/2)−1,
are assumed to be the first N/2 roots of unity defined by W [i] = ej2πi/N , but we need the complex
conjugate of these, namely W [i] = e−j2πi/N . This means the values stored the structure temp
are
10
Re{temp} = Re{X[i lower]}Re{W [i]} − Im{X[i lower]}Im{W [i]}
= Re{X[i lower]}Re{W [i]} + Im{X[i lower]}Im{W [i]}
where i lower is the the index of the lower butterfly node for a given stage and twiddle factor.
The variable temp is used to store the value of the lower node scaled by the conjugate of the
twiddle factor. This is required so the in-place algorithm can calculate and overwrite the values
in the lower node at current stage in lines 29 and 30 and still have the value of the scaled lower
node at the current stage to calculate and overwrite the top node in lines 32 and 33. This is a
programming concern that comes up often when two or more registers are used for accumulating
or in-place processing. This problem is easily solved by using temporary registers.
For sake of clarity, note that the variables step and stage code the variables r and l from the
previous discussion, respectively. Also, at each stage, q takes on the values {0, 1, . . ., numBF-1}.
For the final C coded decimation-in-time FFT algorithm, the code in Figure 7 is integrated
with the code from Figure 6, along with the header file FFT header.h, to create the C callable
function FFT func.c that we will use to implement our FFT algorithm. In the case that N = 8,
the header file FFT header.h is listed in Figure 8. This header files defines the structure COMPLEX
(see Lab 2) and the order of the FFT to be implemented. This header file may be created using
the homebrew MATLAB function FFT header gen.m, which is available on the class webpage.
Download this file and use it to re-create the file listed in Figure 8.
The C program that we will use to test our FFT function is shown in Figure 9, which also
includes FFT header.h. Together, the C code in Figures 6, 7, 8, and 9 have been grouped
together into the project FFT test.pjt that is available on the class webpage. You are expected
to create a header file like the one listed in Figure 8 using the MATLAB function provided.
Download this project file and accompanying files, FFT test.c and FFT func.c, and create the
header file in MATLAB. Open the project in CCS, and implement it on the DSK to verify that
the FFT algorithm is working correctly.
NB: the program FFT test.c does not terminate in an infinite loop as do programs which use
the codec – it runs to completion and then stops at an exit location. In order to re-run it,
you can either reload the program or click on Debug → Restart followed by Debug →
Run
11
1.) // N, M, COMPLEX defined in FFT_header.h, N2=N/2 (pre-defined)
2.) // COMPLEX W[N2] // twiddle factors, passed to FFT_func.c
3.) COMPLEX temp; // temporary storage complex variable
4.) short i,j,k; // loop indices
5.) short i_lower; // Index of lower point in butterfly
6.) short step;
7.) short stage; // FFT stage
8.) short DFTpts; // # of points in sub DFT and offset to next DFT
9.) short numBF; // # of butterflies in one DFT, offset to lower node
10.)
11.) // Assume X[] is in reverse-bit order and do the M=log2(N) stages of butterflies
12.) step = N2; // step = N/2, N/4, N/8, ... 1
13.) for(stage=1; stage <= M; stage++)
14.) {
15.) DFTpts = 1 << stage; // DFTpts = 2^stage = points in sub DFT
16.) numBF = DFTpts/2; // number of butterflies in sub-DFT
17.) k = 0; // initial twiddle factor index
18.)
19.) // Do butterflies for current stage
20.) for(j=0; j<numBF; j++) // do the numBF butterflies per sub DFT
21.) {
22.) // Compute butterflies that use same twiddle factor, W[k]
23.) for(i=j; i<N; i += DFTpts)
24.) {
25.) i_lower = i + numBF; // index of lower point in butterfly
26.) temp.real = X[i_lower].real*W[k].real + X[i_lower].imag*W[k].imag;
27.) temp.imag = X[i_lower].imag*W[k].real - X[i_lower].real*W[k].imag;
28.)
29.) X[i_lower].real = X[i].real - temp.real;
30.) X[i_lower].imag = X[i].imag - temp.imag;
31.)
32.) X[i].real = X[i].real + temp.real;
33.) X[i].imag = X[i].imag + temp.imag;
34.) }
35.) k += step; // increment twiddle index
36.) }
37.) step = step/2; // calculate step for next stage
38.) }
Assignment
2. Using the DSK and the project FFT test.pjt, located on the class webpage, calculate
the FS coefficients for {x[n] = cos( 3π
8 n), n = 0, 1, . . . , 15}. What are the values of N,
M, and N2? Calculate the 16 discrete Fourier series coefficients by hand using eqn (1)
to verify that the C coded FFT function is working properly. If the sample rate of
the system were fo = 8kHz, then what frequencies would the non-zero FS coefficients
correspond to? If the sequence x[n] were aliased every 16 samples (i.e. x[n] = x[n+16r]
for every integer r) and sent to the on-board codec, what would the output be?
3. Create a function that implements an in-place inverse FFT. Label this function
IFFT func.c. This function should have two complex arrays passed to it, namely
X[] and W[], where X[] will contain the array of elements that the IFFT algorithm
will operate on in-place, and W[] will contain the N/2 twiddle factors. These twiddle
factors should be the same as those used in FFT func.c. Explain what changes need
to be made to convert an FFT algorithm to an IFFT algorithm. (HINT: Compare
equations (1) and (2).) Use the output of the FFT evaluated in question 3 as your
input to your IFFT function. What is the output? (HINT: If your IFFT algorithm
is working correctly, you will find that the cascade of the FFT and the IFFT is the
identity operator. Thus the output of cascade is the input.)
12
// FFT header . h
// This f i l e must be i n c l u d e d i n FFT func . c
// and by th e program t h a t c a l l s FFT func . c
#d e f i n e N 8 // N−p o i n t FFT
#d e f i n e M 3 // M=l o g 2 (N)
#d e f i n e N2 4 // N/2 ( number o f t w i d d l e f a c t o r s )
#d e f i n e PI 3 . 1 4 1 5 9 2 6 5 3 5 8 9 7 9 // f i x e d −p o i n t approx . to p i
13
// FFT test . c
// Used t o t e s t FFT func . c
// N−p o i n t FFT, where N i s d e f i n e d i n FFT header . h
/∗1 ∗/ #i n c l u d e <math . h>
/∗2 ∗/ #i n c l u d e <s t d i o . h>
/∗3 ∗/ #i n c l u d e ” FFT header . h” // d e f i n e s COMPLEX s t r u c t u r e
/∗4 ∗/ // and FFT o r d e r
/∗5 ∗/ v o i d FFT func (COMPLEX ∗X, COMPLEX ∗W) ; // FFT f u n c t i o n p r o t o t y p e
/∗6 ∗/
/∗7 ∗/ COMPLEX X[N ] ; // D e c l a r e i n p u t a r r a y
/∗8 ∗/ COMPLEX W[ N2 ] ; // Used t o ho ld t h e N/2 t w i d d l e f a c t o r s
/∗9 ∗/
/∗10∗/ i n t main ( )
/∗11∗/ {
/∗12∗/ short i ; // l o o p i n d e x
/∗13∗/
/∗14∗/ // C a l c u l a t e t w i d d l e f a c t o r s
/∗15∗/ f o r ( i =0; i <N2 ; i ++)
/∗16∗/ {
/∗17∗/ W[ i ] . r e a l = c o s ( 2 . 0 ∗ PI∗ i /N ) ;
/∗18∗/ W[ i ] . imag = s i n ( 2 . 0 ∗ PI∗ i /N ) ;
/∗19∗/ }
/∗20∗/
/∗21∗/ // I n i t i a l i z e i n p u t a r r a y
/∗22∗/ f o r ( i =0; i <N; i ++)
/∗23∗/ {
/∗24∗/ X[ i ] . r e a l = c o s ( ( f l o a t ) 2 . 0 ∗ PI ∗3∗ i /N ) ;
/∗25∗/ X[ i ] . imag = 0 . 0 ;
/∗26∗/ }
/∗27∗/
/∗28∗/ FFT func (X,W) ; // per fo r m in−p l a c e FFT
/∗29∗/
/∗30∗/ // D i s p l a y r e s u l t s on s c r e e n
/∗31∗/ f o r ( i =0; i <N; i ++)
/∗32∗/ p r i n t f ( ”X[%d ] = \ t %10.5 f + j %3.5 f \n ” , i , ( X[ i ] ) . r e a l , (X[ i ] ) . imag ) ;
/∗33∗/ return 0;
/∗34∗/ }
Figure 9: A C coded program that tests our FFT algorithm. Adapted from [8].
14
5 FIR Filtering using an FFT Algorithm
Now that we have working FFT and IFFT algorithms, we need to incorporate them into a
real-time system. In this section, we will process blocks or vectors of data sequentially in time
instead of calculating scalar outputs sequentially in time. To begin, let’s build a template for
processing blocks of data on the C6713 DSK. We are going to implement the equivalent of the
straight wire program from Lab 2, but we will use it to test our FFT and IFFT algorithms.
Figure 10: Straight Wire using Coded FFT and IFFT Algorithms.
Consider the system in Figure 10. The sequence x b[n] ≈ x[n], with discrepancies resulting from
finite precision processing. However, these effects will be minimal due to the floating-point
algorithm used. The trade-off is that the floating-point algorithm will take more clock cycles to
compute than a fixed-point algorithm would, but the finite precision effects will not need to be
managed. A C code implementation of Figure 10 is given in Figure 11. Note that the functions
FFT func.c, IFFT func.c, and appropriate header files must be included with the project.
The idea in IO stream.c is that samples are read from the on-board codec, converted to floating-
point numbers, and stored in the floating-point array io buffer (input - output buffer). Once
this buffer is full, the samples are copied to a complex array structure labelled process, and
previously processed data is copied to the array io buffer. While this is taking place, the
imaginary part of the process array is set to zero. This is done since the samples being read in
are assumed to be purely real. This may seem redundant since the output in the process array
is assumed to be real, but due to finite precision effects, the imaginary part of the output may
not be exactly equal to zero. Next, the FFT algorithm is applied to the data in the process
array in the main() function, while an interrupt is used to output previously processed samples
and read in new samples. Now, as a sample is outputted from a memory location in the array
io buffer, a new sample is read into the same memory and a global counter (ctr in Figure 11)
is used to point to the next memory location in io buffer. This continues until io buffer is
full, at which point the procedure is repeated.
The DSP algorithm waits for the io buffer to fill in line 28. Once the buffer is full, the variable
flag is set to true in line 16 (located in the interrupt) and the algorithm, upon return from
the interrupt, proceeds to line 30, where it resets the flag variable to false. Then, in lines
31 through 37, the data in io buffer is transferred to the real channel of the process array,
the imaginary part of the process array is zeroed, and previously processed data is moved to
the io buffer. These data transfers will take place between the time the last sample of the
previous block was read in from the codec and the time the first sample of the next block is sent
to the codec. Therefore, the DSP chip will have roughly N to seconds to process the rest of the
algorithm, where N is the length of the block being processed and to is the sample rate of the
system.
15
/∗1 ∗/ // i o s t r e a m . c − i n c l u d e h ead er f i l e s , f u n c t i o n p r o t o t y p e s , and W[ N2 ]
/∗2 ∗/ COMPLEX p r o c e s s [N ] ;
/∗3 ∗/ f l o a t tmp ;
/∗4 ∗/ f l o a t i o b u f f e r [N ] ;
/∗5 ∗/ s h o r t c t r =0;
/∗6 ∗/ s h o r t f l a g =0;
/∗7 ∗/
/∗8 ∗/ i n t e r r u p t void c i n t 1 1 ( ) // i n t e r r u p t s e r v i c e r o u t i n e
/∗9 ∗/ {
/∗10∗/ output sample ( ( sh or t ) i o b u f f e r [ c t r ] ) ;
/∗11∗/ i o b u f f e r [ c t r ++]=( f l o a t ) i n p u t s a m p l e ( ) ;
/∗12∗/
/∗13∗/ i f ( c t r >= N)
/∗14∗/ {
/∗15∗/ c t r =0;
/∗16∗/ flag = 1;
/∗17∗/ }
/∗18∗/ return ; // r e t u r n from i n t e r r u p t
/∗19∗/ }
/∗20∗/
/∗21∗/ v o i d main ( )
/∗22∗/ {
/∗23∗/ short i ; // l o c a l c o u n t e r
/∗24∗/
/∗25∗/ // C a l c u l a t e t w i d d l e f a c t o r s f o r ( I )FFT
/∗26∗/
/∗27∗/ comm intr ( ) ;
/∗28∗/ while (1)
/∗29∗/ {
/∗30∗/ w h i l e ( f l a g ==0);
/∗31∗/
/∗32∗/ // Once th e b u f f e r i s f u l l , implement DSP a l g o r i t h m h e r e
/∗33∗/ flag = 0;
/∗34∗/
/∗35∗/ f o r ( i =0; i <N; i ++)
/∗36∗/ {
/∗37∗/ tmp = p r o c e s s [ i ] . r e a l ;
/∗38∗/ process [ i ] . real = io b uf f e r [ i ] ;
/∗39∗/ p r o c e s s [ i ] . imag = 0 . 0 ;
/∗40∗/ i o b u f f e r [ i ] = tmp ;
/∗41∗/ }
/∗42∗/
/∗43∗/ FFT func ( p r o c e s s , W) ; // p er f or m in −p l a c e FFT
/∗44∗/ IFFT func ( p r o c e s s , W) ; // p er f or m in −p l a c e FFT
/∗45∗/ }
/∗46∗/ }
16
Assignment
4. Download the project IO stream.pjt, IO stream.c, and FFT header.h from the class
webpage. Incorporate your FFT and IFFT functions into this project and build and
run the project on the DSK. Verify that the program is implementing a straight wire.
Notice that this FFT header file is designed for a 128-point FFT. This means that we are
processing data in blocks of 128 samples. Now, once the buffer has filled and the data
has been swapped between the io buffer array and the process array, the algorithm
has about N t0 = 128 ∗ .125ms (about 16ms) to execute the rest of the algorithm. In
reality, this algorithm will finish after only a sample or two has been read in from the
codec (about .25ms), so the DSK will be idle (ignoring the input and output of data
via the codec) for about 15.75ms for every block of data that is processed. This extra
time will allow us to use the FFT (and IFFT) to implement block convolution, which
we will discuss next. This project will serve as a template for doing block processing in
real-time.
A property that holds for all Fourier transforms is that multiplication in one domain transforms
to convolution in the other. Since the DFT is periodic in both domains, the convolution will
be cyclic in both domains. In this section, we wish to do linear convolution of two finite length
sequences in the time domain by first calculating their respective DFTs, multiplying the DFTs
term-by-term3 , and then taking the inverse DFT of the product to get the convolved time
sequence. However, if we apply this process to the finite length sequences only, we will get
a result that is the circular convolution and not the linear convolution of the two sequences.
To get linear convolution from the DFT, we note that the convolution in time of two finite
length sequences, say x1 [n] of length L and x2 [n] of length Q, results in y[n] = (x1 ∗ x2 )[n],
which will be of length L + Q − 1. Using this fact, we can zero-pad the time sequences, so
they are both of length L + Q − 1. To do this, we define x e1 [n] = x1 [n] for n = 0, 1, . . . , L − 1
and 0 for n = L, L + 1, . . . , L + Q − 1, and x
e2 [n] = x2 [n] for n = 0, 1, . . . , Q − 1 and 0 for
n = L, L + 1, . . . , L + Q − 1. Then, we can take the DFTs of x e1 and xe2 , multiply the DFTs
term-by-term, and inverse DFT the product to get the desired linear convolution (x1 ∗ x2 )[n].
In the previous programming example, IO stream.c, the output block size was the same as the
input block size, so there was no overlap between the blocks. However, when block convolution is
implemented, blocks of L samples will be read in from the codec and output blocks of N + Q − 1
samples will be produced, which will overlap by Q − 1 samples. This leaves two options for
dealing with the overlap. Either, read in blocks of L samples, overlap the input blocks (i.e.
make the last Q − 1 samples of a given input block the first Q − 1 samples of the next input
block), convolve the input blocks with the filter response using an FFT method, and save the
last L samples, or read in blocks of L samples, convolve the input blocks with the filter response
using an FFT method, and add the last Q − 1 samples (term-by-term) from the processed data
3
It is assumed apriori that the two sequences are of the same length.
17
of a given block to the first Q − 1 samples of processed data in the next block. These two
algorithms are known as the Overlap and Save and Overlap and Add methods, respectively [6].
In this section, we will develop the overlap and add algorithm, which you will ultimately code
in C. To visualize this algorithm, consider the sequence x[n] in the top panel of Figure 12.
This input stream, x[n], is split into blocks of length L = 6 (shown in the bottom three panels of
Figure 12). Let the FIR filter h[n] be the 1kHz notch filter from lab 4 with duration 3 samples
(i.e. Q = 3). Now, if each of the sequences x1 [n], x2 [n], and x3 [n] is convolved with h[n], then
the results will be of length N = 8, which are shown in the top three panels of Figure 13. The
output is then formed by summing these sequences vertically (term-by-term) to generate the
output y[n].
Assignment
5. Consider the output y[n] in Figure 13 when n = 6, 7, 8, and 9. Show that the summa-
tion of the overlap from the blocks (h ∗ x1 )[n] and (h ∗ x2 )[n] gives the desired result
y[n] = h[0]x[n] + h[1]x[n − 1] + h[2]x[n − 2]. (HINT: Use the linear shift-invariance of
convolution.)
Located on the class webpage is a homebrew MATLAB function, Overlap Add header gen.m
that will create the header files you will need to implement an overlap and add algorithm. This
function will create the header file FFT header.h and coeffs.h, where FFT header.h is the
standard FFT header file that we have been using. The new file, coeffs.h, must be included in
the C source code containing the main() function. This file will contain the filter coefficients,
stored in the floating-point array h[], define the number of inputs to read in, L, and define
the number of samples to overlap, P. In the previous discussion, L was the number of samples
to read in, Q was the filter duration, and Q − 1 was the number of samples to overlap. Now,
18
(h*x1)[n]
(h*x2)[n]
(h*x3)[n]
L remains the same, the number of samples to overlap is P = Q − 1, and the FFT order is
N = L + P 4 . Note that the function Overlap Add header gen.m can generate this information
from the filter coefficients h and the FFT order N, which are the two variables that must be
passed to the function. Download this function from the class webpage and use help and type
to see how this works.
For the overlap and add algorithm, you will need the following global variables:
• a complex structure of length N to hold the DFT of the filter response,
• a complex structure of length of length N to use as a process buffer,
• a floating-point array (IO buffer) of length L to interface the codec,
• a floating-point array of length L to hold the samples to overlap from the previous block
of processed data; this array will also be used to store the final processed block of data
that will be copied to the IO buffer,
• temporary variable(s) for the complex multiplication of the filter FS coefficients with the
DFT of the zero-padded input sequence, and
• other miscellaneous variables needed to implement any real-time block processing algo-
rithm.
Once you have your variables in order, you must initialize some of the variables. Before the
codec is initialized, initialize the following:
• the twiddle factors, and
• the discrete FS coefficients of the zero-padded filter impulse response.
After these initializations are made, initialize the codec, and code the overlap and add algorithm
as follows (in the order given):
4
Note that P is the filter order as it was defined in Lab 4, except the letter to denote this has been changed
from N to P . This is done to avoid a conflict of variables with the FFT order N .
19
• copy the IO buffer into the first L memory locations in the process array and copy the
previously processed data into the IO buffer,
• copy the last P samples from the process array into the overlap array and zero the last
L − P samples of the overlap array,
• zero the imaginary part of the process array and perform an in-place FFT on the array,
• multiply the process array with the DFT of the zero-padded filter coefficients,
• IFFT the process array,
• add the first L samples of the process array to the overlap array, which will serve as the
output for the next block of data.
Assignment
6. Create a project that implements the overlap and add algorithm given above. Include
a copy of the C source code that you used to achieve this. Briefly explain how your
program works. What is the filter latency? Numerically, compare the computations
required to implement this algorithm with the direct convolution of Lab 4.
6 End Notes
In this lab, we have explored the FFT algorithm and one of its applications to real-time systems.
Other applications include spectrum analysis and orthogonal frequency division multiplexing
(OFDM) for modulation and demodulation, which we will explore in more depth in later labs.
References
[1] Rulph Chassaing. DSP Applications Using C and the TMS320C6x DSK. Wiley, New York,
2002.
[2] Colorado State University, Fort Collins, CO. Signals and Systems Laboratory 11: The FFT
and its Applications, 2001.
[3] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier
series. Mathematics of Computation, 19:297–301, 1965.
[4] Simon Haykin and Barry Van Veen. Signals and Systems. Wiley, New York, 1999.
20
[5] M.T. Heideman, D.H. Johnson, and C.S. Burrus. Gauss and the history of the fast fourier
transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1:14–21, October
1984.
[6] Alan V. Oppenheim and Ronald W. Schafer. Discrete-Time Signal Processing. Prentice
Hall, Uper Saddle River, NJ, 1989.
[7] John G. Proakis and Dimitris G. Manolakis. Digital Signal Processing: Principles, Algo-
rithms, and Applications. Prentice Hall, Uper Saddle River, NJ, 1996.
[8] Steven A. Tretter. Communication Design Using DSP Algorithms: With Laboratory Exper-
iments for the TMS320C6701 and TMS320C6711. Kluwer Academic/Plenum Publishers,
New York, 2003.
21