Performance Analysis and Design of A Discreet Cosine Transform Processor Using CORDIC Algorithm
Performance Analysis and Design of A Discreet Cosine Transform Processor Using CORDIC Algorithm
Master of Technology
in
VLSI Design and Embedded System
By
Satyasen Panda
ROLL No: 208EC2o8
Under the Guidance of
Prof. K.K.Mahapatra
CERTIFICATE
This is to certify that the thesis entitled, “Performance Analysis and Design of a Discreet
Cosine Transform processor using CORDIC algorithm” submitted by Satyasen Panda in
partial fulfillment of the requirements for the award of Master of Technology Degree in
Electronics & Communication Engineering with specialization in “VLSI Design and
Embedded System” at the National Institute of Technology, Rourkela is an authentic work
carried out by him under my supervision. To the best of my knowledge, the matter embodied in
the thesis has not been submitted by him to any other University / Institute for the award of any
Degree or Diploma.
I would like to thank all faculty members and staff of the Department of Electronics
and Communication Engineering, N.I.T. Rourkela for their generous help in various ways for the
completion of this thesis.
I would like to thank my friends. I am also thankful to my classmates for all the
thoughtful and mind stimulating discussions we had, which prompted us to think beyond the
obvious.
Satyasen Panda
Contents page
no
Abstract……………………………………………………………………………………i
List of figures……………………………………………………………………………..ii
List of tables……………………………………………………………………………...iv
1. Introduction………………………………………………………………………1
1.1 Motivation…………………………………………………………………………2
4.2.1 Decorrelation…………………………………………………………………………25
4.2.3 Separability……………………………………………………………………………26
4.2.4 Symmetry…………………………………………………………………………….27
4.2.5 Orthogonality…………………………………………………………………………27
5.1.1Chen’s algorithm……………………………………………………………………..31
6.1.1 Preprocessor…………………………………………………………………………..36
6.1.2Post processor…………………………………………………………………………37
b
8.1 Conclusion……………………………………………………………………………..57
References
c
ABSTRACT
CORDIC is an acronym for COrdinate Rotation Digital Computer. It is a class of shift adds
algorithms for rotating vectors in a plane, which is usually used for the calculation of
trigonometric functions, multiplication, division and conversion between binary and mixed radix
number systems of DSP applications, such as Discreet cosine Transform(DCT). The Jack E.
Volder's CORDIC algorithm is derived from the general equations for vector rotation. The
CORDIC algorithm has become a widely used approach to elementary function evaluation when
the silicon area is a primary constraint. The implementation of CORDIC algorithm requires less
complex hardware than the conventional method.
In this thesis, the CORDIC algorithm has been implemented in XILINX Spartan 3E FPGA kit
using VHDL and is found to be accurate. It also contains the implementation of Discrete Cosine
Transform using radix-2 decimation-in-time algorithm in Xilinx. on the same FPGA kit. Due to
the high speed, low cost and greater flexibility offered by FPGAs over DSP processors the FPGA
based computing is becoming the heart of all digital signal processing systems of modern era.
Moreover the generation of test bench by Xilinx ISE 9.2i verifies the results with directly
computed dct values from mat lab.
i
List of figures page no
4.1. Two dimensional DCT functions for N = 8. Gray represents zero, white represents positive
amplitudes, and black represents negative amplitude…………………………………………24
4.2. (a) Normalized autocorrelation of uncorrelated image before and after DCT; (b) Normalized
autocorrelation of correlated image before and after DCT ………………………………….25
ii
6.1 Top level schematic of CORDIC module…………………………………………………37
iii
List of tables page no
iv
Chapter 1
Introduction
1
1.1MOTIVATION
2-D DCT algorithms are the mostly used for image compression, the main focus
of this thesis will be on the efficient hardware implementations of 2-D DCT based image
2
compression by reducing the number of computations, increasing the accuracy of
reconstruction, and reducing the chip area using CORDIC algorithm. This reduces the
power consumption of the compression technique. The number of applications that
require higher-dimensional DCT algorithms are growing, emphasis will be paid to the
algorithms like CORDIC that are extensible to higher dimensional cases.
3
Chapter 2
4
2.1 CORDIC fundamentals
5
The basic block diagram of CORDIC processor is given below..
Y Y= x sin( ) y cos( )
During the angle conversion phase, the angle is represented as the sum of a
sm ,i
decreasing sequence of elementary angles { m,i 2 defines a radix 2 number system. ,
representation of , and b is the number of bits in the register. Variable m{1,0, 1}
represents the rotation in three types of systems: the circular, linear and hyperbolic
respectively .In DCT, is known from the beginning and the operation is always in
b 1
Circular system with m=1. The scaling factor K cos( i tan 1 2i ) will be a constant
i 0
calculating the limit comes to 0.6072512.All the angles in angle conversion contains pre
evaluation of arc tan. So this is implemented in the form of a look-up table in hardware
system. The circular cordic rotation is shown in figure below.
6
Fig2.2: circular Cordic rotation
All the trigonometric functions can be evaluated from functions using vector rotations.
The CORDIC algorithm provides an iterative method of doing vector rotations by certain
angles using only shifts and add operations. The algorithm is derived using the general
rotation transform:
where (X’,Y’) are the coordinates of the resulting vector after rotation of a vector with
coordinates (X,Y) through an angle of in the rectangular plane. These equations can be:
7
Now, if the rotation angles are restricted such that tan( ) =2-i then the tangent
multiplication term is reduced to a shift operation. Hence angles of rotation can be found
by doing a continuously smaller elementary rotations. The above equation for rotation
can be expressed as:
X i 1 Ki ( X i Yi i 2i )
(3)
Yi 1 Ki (Yi X i i 2i )
Where Ki= cos(tan-1(2-k)) and k = 1 depending upon the earlier iteration. Taking away
the scale constant from the equations yields a shift and add algorithm for vector rotation.
The product Ki approaches the value of 0.607312. The CORDIC algorithm in its binary
version can be expressed as a sequence of three equations as shown:
Xk 1 Xk mYk.k.2 k
Yk 1 Y X . .2
k k k
k
(4)
Zk 1 Zk .k.k
Where m = 1.
To evaluate the sin and cos for /2, we let m = 1, k = tan-1(2-k) and assume:
n (5)
C cos(k )
k 0
Then the equations of the CORDIC algorithm for evaluating sine and cosine functions
can be shown as:
Xk 1 Xk Yk.k.2 k
Yk 1 Y X . .2
k k k
k
(6)
Zk 1 Zk .k.k
The radix 2 system is taken here because it avoids the use of multiplications while
implementing the above equation. Hence a CORDIC iteration can be realized using
shifters and adders only.The lower figure shows the structure of a processing element
which implements one CORDIC iteration.
8
Fig 2.3: structure of a processing element for one CORDIC iteration
The rotation mode and vectoring mode are two schemes for the CORDIC
algorithm. In rotation mode, the aim is to rotate the given input vector (x , y)t with a
given angle. After n nos of iterations, zn is driven to zero and the total accumulated
rotation angle is equal to desired angle
However, the CORDIC iteration is not a perfect rotation. It is pointed out that
for a fixed-point implementation with data word length of W bits, no more than W
CORDIC iterations required to be performed. The large number of iterations decreases
its speed performance in a measure way.Also the algorithm computational accuracy
needs to be taken into account . For example, the finite precision of the involved
variables and the rounding errors; a desired rotation angle can only be considered by
the n rotational angles so that the accuracy achieved by the CORDIC rotation is
determined .
micro rotations will be removed. That in turn reduces the nos of iterations and speeds up
the CORDIC algorithm. Also to get the correct rotated coordinates, the result must be
multiplied with a scaling factor K.If we remove this scaling factor, the complexity of
9
CORDIC algorithm will be reduced. And also, for a given rotation angle by changing
the micro rotation angles from i tan 1 2i to i sin 1 2i 2i , then we should not
create a look up table and the burden of creating hard ROM will be reduced. The large
number of iterations decreases its speed performance seriously and also takes large
power. Secondly, a scale factor operation is required in order to approve the final
coordinate [Xf, Yf]T has the same norm as the initial coordinate [Xo,Yo]T .
The series up to order three is applied to Equation (1) with the correction of
coefficients. The new CORDIC algorithm can be summarized below. Equation (8)
describes a rotation with a scaling of an intermediate plane vector
vi [ xi , yi ]T to vi 1 [ xi 1 , yi 1 ]T .
xi 1 xi .cos( i ) yi .sin( i )
xi .(1 21. 2 ) yi .( 2 4. 3 )
xi .(1 22i 1 ) yi .(2 i 23i 4 )
yi 1 yi .cos( i ) xi .sin( i ) (8)
yi (1 21. 2 ) xi .( 2 4 3 )
yi .(1 22i 1 ) xi .(2 i 2 3i 4 )
10
zi 1 zi i . i
0 zi i
zi i .2i with i and i {0,...., n 1}
i i
1 z
where
i 2 i
(9)
zi 1 zi i .2i with i {2,3,...., n 1}
In case of large target rotation angle ,if i is set too small, the rotation
precision will be increased as well as the number of CORDIC iterations required . Hence
the precision of the design needs to be traded of with the hardware limitations and other
performances.
The new CORDIC algorithm has the rotation angles from 0 to 45 degree(s).
The value of the start point vector [x,y]T is [1,0]T. Simulations mat lab show that in
maximum rotation number for each rotation angle is fixed to ten.
11
Fig 2.4: No of iterations for angle 0o to 45o
However for the new CORDIC algorithm, the input angle is limited to 0o to
45o.So if we want to have rotation for other angles less than 90 o, method of “domain
folding” must be applied. For example if we want to rotate the angle ( ) between 90o
and 45o, we have to perform .Then negative rotation of the must be done so
2
that
So for the equations (8) we have to change „-„to „+‟ and vice versa. After
approximated the angle as right shifting, the error between the result computed by
conventional CORDIC and new CORDIC algorithm is very less as shown in the error
plot given below.
12
Fig 2.5: Error plot between New and Conventional CORDIC.
Even though the new algorithm is good in terms of lack of scaling factor and reducing the
number of iterations, it has got some disadvantages like in the word length as the word
length must be taken as 32 bits. It is the optimum word length. Now the design of each
iteration of the new Cordic algorithm is given below.
13
Chapter 3
14
3.1 Different architectures of CORDIC
There are mainly four types of architectures with which the CORDIC ip core can
be designed. They are 1) Iterative word-serial architecture
2) Parallel-Pipelined architecture.
3) Iterative Bit-Serial architecture.
4) Iterative Bit-parallel architecture
Now the description of every architecture is described below..
15
Depending on the CORDIC mode (rotation or vectoring), the sign-controlling
logic block performs either the RegY or the RegA sign bit(positive or negative). So that
it can decide what type of operation (addition or subtraction) needs to be performed after
each iteration. The Look Up Table keeps a pre-computed table of the values. The
number of entries in the Look Up Table equals the required number of iterations, n.
The iterative word-serial CORDIC algorithm takes n + 1 clock cycles to complete a
single vector coordinate conversion.
3.1.2 Parallel-pipelined architecture:
This architecture represents a version of the sequential CORDIC algorithm .
Instead of reusing the same hardware for all iteration stages, the parallel architecture
provides a separate processor for every iteration. An example of the parallel CORDIC
architecture for rotation mode is shown in fig 2.2
………………………………………………………..
………………………………………………………..
Fig 3.2: Parallel pipelined architecture for CORDIC
16
Each of the n processors present in the block performs a specific iteration, and a
particular processor always performs the same iteration. All the shifters perform the fixed
shift, so that it can be implemented in the FPGA . Every processor utilizes a individual
arctan value that can also be hardwired to the input of every angle accumulator in the
absence of a state machine which provides simplicity to this type of architecture.
The parallel architecture is much faster than the sequential architecture described
in the “iterative Word-serial architecture “in fig 2.2. It takes new input data and puts out
the results at every clock cycle, introducing a latency of n clock cycles. The architecture
which is used in the design of the 2D DCT is this parallel-pipelined architecture because
this architecture which provides high throughput and low power consumption.
17
Fig 3.3: Iterative bit-Serial CORDIC architecture.
18
The z branch arithmetically combines the registers values with the values taken
from a lookup table whose address is changed according to the number of iteration. For n
numbers of iterations the output is mapped back to the registers before initial values are
fed again and the final sine value can be accessed at the output. A simple finite-state
machine(FSM) is needed to control the multiplexers, the shift distance and the addressing
of the permanent values.
When implemented in an FPGA the initial values for the vector coordinates as
well as the constant values in the LUT(lookup table) can be hardwired . The adders and
the subtractors components are carried out separately and a multiplexer controlled
according to the sign of the angle accumulator differentiates between addition and
subtraction by routing the signals as required. The shift operations as implemented
change the shift distance with the number of iterations accordingly. In addition the output
19
rate is also decreased by the fact that operations are performed iteratively and therefore
1
the maximum output rate equals times the clock rate.
n
Of all the architectures of the CORDIC are mentioned and described, the
architecture used in this project is the parallel pipelined architecture.
20
Chapter 4
21
4.1 Overview of DCT
Discrete cosine transform (DCT) is widely used in image processing, especially for
compression. The Discrete Cosine Transform (DCT) was first proposed by Ahmed et al.
(1974), and it has got more importance in recent years Some of the applications of two-
dimensional DCT involve image compression and compression of video frames, while
multidimensional DCT is mainly used for compression of video streams.
2-D DCT algorithms are the mostly used for image compression, the main focus
of this chapter will be on the efficient hardware implementations of 2-D DCT based
image compression by reducing the number of computations, increasing the accuracy of
reconstruction, and reducing the chip area. This reduces the power consumption of the
compression technique. The number of applications that require higher-dimensional DCT
algorithms are growing, emphasis will be paid to the algorithms that are extensible to
higher dimensional cases.. Since JPEG has some very useful strategies for DCT
quantization and compression, it was only developed for low compressions. The 8 8
DCT block size was chosen for speed not for performance.
Like other transforms like DFT, FFT,DST the Discrete Cosine Transform
(DCT) trys to decorrelate the image data. After decorrelation each transform coefficient
can be encoded independently without losing compression efficiency.So this part
describes the DCT and some of its important properties.
N 1
(2 x 1)u
C (u ) (u ) f ( x) cos (1)
x 0 2N
for u = 0,1,2,…., N — 1. The inverse transformation is defined as
22
N 1
(2 x 1)u
f ( x) (u )c(u ) cos (2)
u 0 2N
1 N 1
It is clear from equation(1) that for u = 0, c(u 0) f ( x) . Thus, the first
N x 0
transform coefficient is the average value of the all the sample sequences.This value is
referred to as the DC Coefficient. All other transform coefficients are called the AC
Coefficients.
N 1
(2 x 1)u
The plot of cos
x 0 2N for N = 8 and varying values of u is shown in Figure
1. The first the top-left waveform (u = 0) offers a constant (DC) value, whereas, all other
waveforms (u = 1,2,...,) give waveforms at increasing frequencies . These waveforms are
called the cosine basis function. Note that these basis functions are orthogonal.
If the input sequence has more than N nos of sample points then
it can be divided into sub-sequences of length N and DCT can be applied to these parts
independently. Only the values of function will change in each sub-sequence. This is a
important property, since it shows that the basis functions can be pre-calculated offline
and then multiplied with the sub-sequences. This reduces the number of mathematical
operations thereby providing computation efficiency.
23
1 N 1 N 1
cos(2 x 1)u cos(2 y 1)v
C (u, v) (u ) (v) f ( x, y ) (4)
4 x 0 y 0 2N 2N
for u,v = 0,1,2,….,N — 1 . The inverse DCT transform is defined as
N 1 N 1
(2 x 1)u (2 y 1)v
f ( x, y ) (u ) (v)C (u, v) cos cos 2 N (5)
u 0 v 0 2N
for x,y = 0,1,2,…N —1 . The 2-D functions can be generated by multiplying the
horizontally oriented 1-D functions with vertically oriented set of the same functions .
The basis functions for N = 8 are shown. Again, it can be noted that the basis functions
shows a progressive increase in frequency both in the vertical and horizontal direction.
The top left basis function of results from multiplication of the DC component in with its
transpose. Hence, this function assumes a constant value and is referred to as the DC
coefficients.
Fig 4.1. Two dimensional DCT functions for N = 8. Gray represents zero, white
represents positive amplitudes, and black represents negative amplitude
24
4.2 Fundamental properties of Discreet Cosine Transform
This section provides some properties of the DCT which are of particular importance to
image processing applications.
4.2.1)Decorrelation
The main advantage of image transformation is the removal of redundancy between
neighboring pixels. So that uncorrelated transform coefficients which can be encoded
independently. The normalized autocorrelation of the images before and after DCT is
shown in Figure below. Clearly, the amplitude of the autocorrelation after the DCT
operation is very small. Hence, it can be assumed that DCT exhibits excellent
decorrelation properties.
Fig 4.2. (a) Normalized autocorrelation of uncorrelated image before and after
DCT; (b) Normalized autocorrelation of correlated image before and after DCT.
25
4.2.2)Energy Compaction
A transformation scheme should have the ability to pack input data into as few
coefficients as possible. This allows the quantize to remove coefficients with relatively
small amplitudes without visual distortion in the reconstructed image. DCT exhibits
excellent energy compaction for high correlated images
Hence, it can be inferred that DCT provides excellent energy compaction for
correlated images. The energy compaction performance of DCT approaches optimality as
image correlation approaches one .DCT provides optimal decorrelation for such images .
4.2.3)Separability
The DCT transform equation can be expressed as,
N 1 N 1
(2 x 1)u (2 y 1)v
C (u, v) f ( x, y ) cos cos (6)
x 0 y 0 2N 2N
26
This property, known as separability, has the main advantage that C (u, v) can
be computed in two steps by successive 1-Dimentional operations on rows and columns
of an image. The arguments presented can be identically applied for the inverse DCT
computation .For the hardware design, this property is utilized.
4) Symmetry
Here the row and column operations in Equation 6 reveal that those operations
are functionally identical. Such a transformation is called a symmetric transformation. A
separable and symmetric transform can be expressed in the form.
T = AfA (7)
4.2.5) Orthogonality
27
Chapter 5
Different implementations of
Discrete Cosine Transform
28
5.1 Different implementations of DCT
There are three different categories of approach for computation of the 2-D DCT.
The first category of 2-D DCT implementation is indirect computation through other
transforms like the Discrete Hartley Transform (DHT) and the Discrete Fourier
Transform (DFT). The DHT-based algorithm has increased performance in throughput,
latency, and turnaround time. A DFT calculates the odd-length DCT, which is not
applicable to this project since the design must be compatible with JPEG standards.
29
used in the row-column decomposition, it uses all real arithmetic including 8 1-D DCTs,
and stages of pre-adds and post-adds (a total of 234 additions) to compute the 2-D DCT.
So, the number of multiplications for most implementations should be halved as
multiplication appears within the 1-D DCT.
Since row-column decomposition is very useful for VLSI implementation, that
implementation is considered in this project. In that implementation, starts with one-
dimensional transform.
30
The theoretical implementation of 1-D DCT algorithm is a simple transform
given by
Y=AX, where X is 1-Dimentional array of data.
Where it involves 64 multiplications and 56 additions to compute the entire the 1-
D DCT and also it is very complex to route over FPGA. The matrix A is given by
5.1.1Chen’s algorithm
The fast 1-D DCT algorithm that was selected for use in both the direct and row-
column 2-D approaches was developed by Chen .The 8-point, 1-D DCT, written in
matrix factorization.
x0 A A A A x0 x7
x
2 1B C C B x1 x6
x4 2 A A A A x2 x5
x6 C B B C x3 x4
(1)
x1 D E F G x0 x7
x
3 1 E G D F x1 x6
x5 2 F D G E x2 x5
x7 G F E D x3 x4
Where A=cos (pi/4), B=cos (pi/8), C=sin (pi/8), D=cos (pi/16), E=cos (3*pi/16), F=sin
(3*pi/16), G=sin (pi/16)
31
1
where c(u)=
2
F (0) f (0) f (7) f (3) f (4) cos f (1) f (6) f (2) f (5) sin
4 4
(3,4)
F (4) f (0) f (7) f (3) f (4) cos f (1) f (6) f (2) f (5) sin
4 4
Therefore, in order to compute both F (0) and F (4), we need one CORDIC
processor. F (2) and F (6) can be obtained by using the rotation mode of CORDIC. For F
(1) and F(7), F(5), and F(3), we need four CORDIC processors. So we can use six
CORDIC processors for the 2D-DCT by applying the 1DDCT two times.
The number of iterations can be decreased, since the coefficients for 8×1 DCT are
fixed. The compensation process for the final CORDIC calculation can be composed of
adder and shifter without multiplier as expressed below.
X i 1 X i (1 i .Fi )
(5)
Yi 1 Yi (1 i .Fi )
32
where i has the value 1 and Fi is for shift operation.
Table 5.2 shows the detailed number of rotation for iterations and compensation
in six CORDIC processors.
33
processor CORDIC(1) CORDIC(2) CORDIC(3)(6) CORDIC(4)(5)
3 7 3
angle
4 8 16 16
CORDIC iteration[ ,i]
1 [1,2] [1,2] [1,3] [1,2]
2 [1,2] [1,3] [1,4] [1,2]
3 [1,2] [1,6] [1,7] [1,4]
4 [1,5] [1,9] [1,10] [1,6]
5 [1,9] [1,13] [1,7]
6 [1,10] [1,9]
7 [1,10]
Table 5.3: Algorithm of the new Cordic algorithm used for the Calculation of 1-D
DCT.
34
Chapter 6
35
6.1 DESIGN OF CORDIC MODULE
The core is built using pipeline stages, each stage representing a single step
in iteration process. The processor input takes x and y coordinates of a given vector as
signed values, the processor core rotates the original vector to align it with x-axis .After
the input data processing the calculated data of rotational angle and vector magnitude is
provided in processor output.
The processor input has 16 bit input data width,20 bit output data
width and 15 nos of iteration is done.
6.1.1 PRE-PROCESSOR
To fit to the range allowable for core processing, that requires a pre-
processor which detects the right quadrant where the given vector is located and fits it
into the range from 0 to 45 degrees.
36
6.1.2 POST-PROCESSOR
After being processed in the core ,the results need to be corrected which is
called vector length correction done by post-processor by multiplying its value by
0.85879,the angle correction done by rotating the angle to the corr. Quadrant of the plane.
6.1.3CORDIC CORE
CORDIC core performs the cordic algorithm on the input data sent by the pre-
processor. It contains various pipelined stages for better performance. The stage performs
single iteration step. It contains arc tan table for each iteration and the logic for handling
the x,y,z values.
37
CORDIC POCESSOR WITH ALL SUB_MODULES
38
DIFFERENT PIPELINED STAGES INSIDE CORDIC CORE
39
SINGLE STAGE IMPLEMENTATION OF CORDIC ALGORITHM
40
DESIGN OF CORDIC MODULE USING MAT LAB SIMULINK
41
System generator modeling of the above simulink model
42
6.2 DESIGN OF DCT CORE
The design flow of DCT core is shown particularly in the following flow-chart
given below…
43
make data in the form of 32 bits. CORDIC, as virtually any FPGA DSP core does,
utilizes fixed-point arithmetic. In particular, the numbers the core operates with are
presented as two's complement signed fractional numbers. To identify the position of a
binary point separating the integer and fractional portions of the number, the Q format is
commonly used. An mQn format number is an (n + 1)-bit signed two's complement
fixed-point number: a sign bit followed by n significant bits with the binary point placed
immediately to the right of the m most significant bits. The m MSBs represents the
integer part, and (n-m) LSBs represent the fractional part of the number, called the
mantissa. Table 6.1 depicts an example of a 1Qn format number.
44
not have to fit the limited range. To convert floating-point linear input data to the 1Qn
format, follow the simple rule in EQ 10:
1Qn Fixed-Point Data = 2n-1 x Floating-Point Data (1)
Here it is assumed the floating-point data are presented in the range from -1.0
to 1.0. The product on the right-hand side of Eq (1) contains integer and fractional parts.
The fractional part has to be truncated or rounded. shows a few examples of converting
the floating-point numbers to the 1Q15 format.
To convert the 1Qn format back to the floating-point format, use EQ 11.
Floating-Point Data = 1Qn Fixed-Point Data/2n-1 (2)
Suppose we are using the input in which one of the number is 34,then we have to convert
to 1Q17 format where we have to divide 34 with 1000 and then we have to multiply with
2^16 and then round it…so that the end result is round(0.034*2^16) which results in
2228.Similarly the angle format for the CORDIC design is also the same, as mentioned
above.
For the Design of 2D DCT using the New Cordic Algorithm, all the inputs must
be 32 bits wide. All inputs are converted from decimal to fixed-point binary
representation in Matlab. For this 32-bit design, the least 31 bits are used to represent the
decimal fraction. The Most Significant Bit (MSB) is used as the sign bit. To check the
output data of x' and y' at each rotation angle, we first convert angle from binary to
decimal representation, and then divided by 2 P, where P is the number of bits used to
represent the decimal fraction. Then we can calculate the value of cos and sin for the
estimation of output x' and y' respectively. If x' and y' are almost the same as the sine and
Cosine value that we calculate, then we can say the operation of the CORDIC ip is
Correct. For example
45
( ) 2 (00000010001110111110100011010100) 2 radians
( )10 (37480660)10 / 2 p radians
(37480660)10 / 231 0.0175 radians
The above example verifies that output x' and y' are correct for its given
rotation angle. Several outputs with different input rotation angles are chosen randomly
to verify the correctness by following the steps illustrated in the example. Moreover, the
simulation results are generated in waveforms so that we check not only the value of the
outputs but also see if there is any timing matching problem within the overall design.
In the case of the Design of 2D DCT using the new CORDIC algorithm is
shown below.
46
Fig 6.7 : FSM for DCT Design using CORDIC algorithm
In the design of DCT core using CORDIC algorithm, algorithms AR Cordic as well
as the new CORDIC algorithm, the same state diagram is used. From the above state
diagram it is implied the complexity of DCT architecture is less in the case of CORDIC
architecture, where in the conventional one consists of multiplications, where in the
normal CORDIC architecture consists of only rotations and shiftings.
Now consider about the design of Matrix transposer cell which is necessary a some
sort of transpose buffer. The need for real-time implementation of the transposition
operation is felt particularly in image processing applications as they are dominated by
matrix based techniques. For example, a wavelet operation on a two-dimensional array of
data is executed as follows: First, the wavelet operation is executed on the rows
(columns) of data followed by a transposition operation. This process is then repeated on
the columns (rows) of data.
47
The external structure of DCT module is given below:
48
DIN_0…6(DATA INPUTS):
Din_0,Din_1,Din_2,Din_3,Din_4,Din_5,Din_6,Din_7 are inputs of 8 bits width .The
Module will read the data when ND is high.
ND(NEW DATA):
When this input signal is high it indicates that valid data is available at the input DIN. If
RFD is high then the module reads this data.
RST (Reset):
Reset allows user to restart the 2-D DCT process.
CLK (Clock):
This clock signal is used to synchronize the module and data input output operations
FINAL DCT:
This signal indicates that whether the data at the output port is valid or not.
49
RTL SCHEMATIC OF THE ABOVE DCT MODULE
50
Chapter 7
Simulation Results
51
7.1 Simulation results of CORDIC
At first the designed CORDIC module is tested with 16 bit data to get the
underlying simulation.
The inputs to the DCT core using CORDIC algorithm is given from a text file
which contains the data as
23 44 34 33 34 29 35 23
22 54 54 33 34 29 35 23
22 64 64 33 34 29 35 33
11 44 64 33 34 29 35 43
45 64 84 33 34 29 35 53
39 76 60 27 33 31 31 52
23 72 85 30 32 34 31 68
41 31 77 79 30 31 28 39
Actually this data is the starting 8x8 matrix in a file .Now we did the 2D DCT using
matlab and got the result as
Y= [159.5000 2.7683 4.304 -0.2992 0.2400 -0.539 -4.5294 5.6385
7.9473 -0.779 0.547 -4.9323 1.9602 2.9784 -3.7971 3.3222
5.349 -0.274 -15518 1.7250 -0.6765 -0.4525 1.8499 -2.2002
1.2619 1.390 1.7039 0.942 -0.706 -1.3686 0.2143 1.1422
-1.4000 -1.497 -0.3431 -1.596 1.700 1.189 -1.4815 -0.6729
.2107 0.3561 -1.821 -0.1061 -2.0116 0.097 1.6866 0.7552
-2.7028 0.3650 3.0999 1.6355 1.6332 -2.2546 -1.182 -0.911
1.2259 -0.3152 -2.326 -1.5945 -0.8846 1.9693 0.532 0.6540]
Now the same file is compiled, synthesized and simulated using Xilinx9.1ise and
directly from the Xilinx itself we are saving the result in a file
53
Simulation result of DCT core in Xilinx
54
So there is 1% error which can easily neglected. So with the help of the CORDIC
algorithm even though there is an error, we can compensate this with low power and less
area as well more compact design. This result is shown in the data results soon. Now
considering the design with DCT using the New CORDIC algorithm. Since in the design
in order to maintain a good precision, we have to take a more bit width.
Now consider the synthesis reports created by Xilinx 9.1ise.The
family used for synthesizing are tabled below.
Device Family Virtex2P
Device XC2VP70
Package FF1704
Speed -6
55
Chapter 8
CONCLUSION
56
8.1 CONCLUSION
The CORDIC algorithm is a powerful and widely used tool for digital signal processing
applications and can be implemented using PDPs (Programmable Digital Processors).
But a large amount of data processing is required because of complex computations. This
affects the cost, speed and flexibility of the DSP systems. So, the implementation of DFT
using CORDIC algorithm on FPGA is the need of the day as the FPGAs can give
enhanced speed at low cost with a lot of flexibility. This is due to the fact that the
hardware implementation of a lot of multipliers can be done on FPGA which are limited
in case of PDPs.
In this thesis the CORDIC module is simulated using Xilinx which is
then used for simulation of Discrete Cosine Transform. Then the implementation of DCT
processor using CORDIC module is done on XILINX . The results are verified by test
bench generated by XILINX simulator. This thesis shows that CORDIC is available for
use in DSP Processors based computing machines, which are the likely basis for the next
generation DSP systems. It can be concluded that the designed RTL model for CORDIC
and DCT function is accurate and can work for real time applications.
57
REFERENCES
.
[1] Javier Valls, Martin Kuhlmann, and Keshar K. Parhi.“Evaluation of CORDIC
algorithms for FPGA design”. Journal of vlsi signal processing.vol.32,2008
[2]Maharatna, K., Troya, A., Krstic, M., Grass, E., and Jagdhold, U.: „A CORDIC
like processor for computation of arctangent and absolute magnitude of a vector‟.
Proc. IEEE Int. Symp. on Circuits and Systems (ISCAS), 2007
[3]Maharatna, K., Dhar, A.S., and Banerjee, S.: „A VLSI array architecture for
realization of DFT, DHT, DCT and DST‟, Signal Process., 2004
[4]Deprettere, E., and Udo, R.: „The pipelined CORDIC‟. Internal Report
Network Theory Section, Delft University of Technology, 2003.
[6] Iain E. G. Richardson, Video Codec Design, JohnWiley & Sons Ltd, Atrium,
England, 2002.
[7] J. Li and Shih Lien Lu, “Low Power Design of Two- Dimensional DCT,” in
IEEE Conf. on ASIC and Exhibit,Sept. 1996, pp. 309–312.
[8] S.F. Hsiao, Y.H. Hu, T.B. Juang, and C.H. Lee, “Effi-cient VLSI
Implementations of Fast Multiplier less Approximated DCT Using Parameterized
Hardware Modules for Silicon Intellectual Property Design,” IEEE Trans.
Circuits Syst. I, vol. 52, pp. 1568–1579, Aug.2005.
[9] N. J. August and Dong Sam Ha, “Low Power Design of DCT and IDCT for
Low Bit Rate Video Codecs,” IEEE Transactions on Multimedia, vol. 6, pp. 414–
422, June 2004.
[10] Hyeonuk Jeong, Jinsang Kim, and Won Kyung Cho, “Low-Power
Multiplierless DCT Architecture Using Image Correlation,” IEEE Trans.
Consumer Electron.vol. 50, pp. 262–267, Feb. 2004.
58
[11] A. Shams, W. Pan, A. Chidanandan, and M. A. Bayoumi, “A Low-Power
High Performance Distributed DCT Architecture,” in IEEE Computer Society
Annual Symposium on VLSI, Apr. 2002, pp. 21–27.
[12] L. Fanucci and S. Saponara, “Data Driven VLSI Computation For Low-
Power DCT-Based Video Coding,” in International Conf. on Electronics, Circuits
and Systems, Sept. 2002, pp. 541–544.
[14] Zhongde Wang, “Fast Algorithms for the Discrete W Transform and for the
Discrete Fourier Transform,”IEEE Trans. Acoust., Speech, Signal Processing,
vol. 32, pp. 803–816, Aug. 1984.
59
60