0% found this document useful (0 votes)
28 views9 pages

System-On-Chip Design Using High-Level Synthesis Tools

Uploaded by

yo bro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views9 pages

System-On-Chip Design Using High-Level Synthesis Tools

Uploaded by

yo bro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Circuits and Systems, 2012, 3, 1-9

http://dx.doi.org/10.4236/cs.2012.31001 Published Online January 2012 (http://www.SciRP.org/journal/cs)

System-on-Chip Design Using High-Level Synthesis Tools


Erdal Oruklu1*, Richard Hanley1, Semih Aslan2, Christophe Desmouliers1,
Fernando M. Vallina3, Jafar Saniie1
1
Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, USA
2
Ingram School of Engineering, Texas State University, San Marcos, USA
3
Xilinx Inc., San Jose, USA
Email: *[email protected]

Received August 13, 2011; revised November 20, 2011; accepted November 30, 2011

ABSTRACT
This paper addresses the challenges of System-on-Chip designs using High-Level Synthesis (HLS). HLS tools convert
algorithms designed in C into hardware modules. This approach is a practical choice for developing complex applica-
tions. Nevertheless, certain hardware considerations are required when writing C applications for HLS tools. Hence, in
order to demonstrate the fundamental hardware design concepts, a case study is presented. Fast Fourier Transform (FFT)
implementation in ANSI C is examined in order to explore the important design issues such as concurrency, data recur-
rences and memory accesses that need to be resolved before generating the hardware using HLS tools. There are addi-
tional language constraints that need to be addressed including use of pointers, recursion and floating point types.

Keywords: System Level Design; High Level Synthesis; Field Programmable Gate Arrays; Fourier Transform

1. Introduction through-put and area, different modifications of the code


required by HLS tools are presented step by step.
In the past decade, there has been a substantial increase
Section 2 of this paper provides a brief background of
in the level of hardware abstraction that High-Level
HLS tools, and the current pedagogical techniques. Sec-
Synthesis (HLS) [1-5] tools offer, which has made de-
tion 3 presents an introduction of the FFT algorithm
signing a complete System-on-Chip (SoC) much more
along with a software implementation. This software
practical. By designing at the system level, it has become based FFT is then deconstructed in Section 4, where a
possible for hardware engineers to avoid gate-level se- fully synthesizable product is created. Section 5 analyzes
mantics. HLS tools work by taking applications written the different results that can be produced depending on
in a subset of ANSI C, and translating it into a Register the constraints selected by the user such as speed, area,
Transfer Level (RTL) module for Application-Specific throughput, and targeted system.
Integrated Circuit (ASIC) or Field Programmable Gate For this paper, all designs are targeted for the Xilinx
Arrays (FPGAs) chip design. The design workflow re- Virtex-5 FPGA platform [9] using the HLS tool called
quires knowledge of both software to write C applica- PICO, provided by Synfora Inc [10] (currently known as
tions and hardware to parallelize tasks, resolve timing Synphony C Compiler by Synopsys [11]). However, the
and memory management issues. There has been signifi- different code modifications presented in this paper are
cant previous work that discusses how to teach RTL con- applicable to other HLS tools such as AutoESL [12]
cepts to students and design simple applications for SoCs which targets primarily Xilinx FPGAs with architecture
[6,7]. Nevertheless, the learning curve for software engi- aware synthesis and Catapult C [13] which provides full-
neers is relatively high since they need to use Hardware chip high-level synthesis for both ASIC and FPGA de-
Descriptive Languages (HDL) such as Verilog and VHDL. vices and automatic RTL verification.
By using HLS tools, software engineers can use their pro-
gramming skills along with hardware knowledge to create
2. High Level Synthesis Tools
complex embedded hardware/software co-design systems.
To demonstrate the critical hardware and software de- This section outlines the important concepts that software
sign issues, a Fast Fourier Transform (FFT) [8] case developers need to know before entering the field of HLS.
study is used as a guideline. In order to generate hard- There is a special emphasis on how these concepts differ
ware modules satisfying predefined constraints such as from the contemporary software environment familiar to
*
Corresponding author. the software engineers. Design of SoCs has historically

Copyright © 2012 SciRes. CS


2 E. ORUKLU ET AL.

been accomplished using Hardware Descriptive Lan- Focusing further on HLS, the design flow is shown in
guages such as VHDL or Verilog. Each expression in Figure 2. Each module of a system is implemented using
HDL represents a group of gates that operate in parallel, high level languages such as C, C++, Java, or Matlab
as opposed to machine instructions executed sequentially. [2,18], which can then be tested automatically with test-
This concept of instruction level parallelism is one of the benches provided by the user. After verification of the
first major hurdles when introducing hardware concepts. complete system, the user can specify in the HLS tool
Once an RTL module is designed, it can be compiled which modules will be converted into hardware accel-
and simulated. The simulation is done by creating a se- erators in order to speed up the application. This is one of
ries of pre-defined inputs, known as a testbench, and re- the core elements of hardware/software co-design that
cording the outputs. If a module passes the simulation software developers need to understand. There are in-
then a low level implementation can be created. This low herent restrictions in the HDLs that are mirrored in the
level implementation then enters the verification process HLS tool. Therefore, the emphasis for teaching HDL to
to ensure that all timing dependencies are met. In prac- software developers is on its constraints and how it af-
tice, simulating and verifying an implementation can take fects the HLS tools.
50% - 60% of the development time, increasing the time- After generation of the hardware modules along with
to-market (TTM) [14]. By automating the simulation and testbenches, the system is verified and can be imple-
verification process, it is possible to greatly reduce the mented using synthesis tools.
development time. This paper, as mentioned earlier, focuses on designing
Integration of HLS tools into the FPGA or ASIC de- a Fast Fourier Transform. The concept of HLS is pre-
sign flow, as shown in Figure 1, allows software design- sented by using PICO (Program-In Chip-Out) Extreme
ers to build hardware modules and speed up the TTM from Synfora [10,19,20] to generate the RTL code of an
significantly. During the generation process of an RTL FFT. To be specific, PICO takes a C-based description of
module from a software implementation, simulation and an algorithm and generates: performance-driven device-
verification are done automatically by using a formal dependent synthesizable RTL code, testbench files, ap-
proof provided during the initial steps. Subsequently, by plication drivers, simulation scripts as well as SystemC
using synthesis tools, the RTL module is implemented based Transaction Level Models (TLM) [3,17,18,21].
and timing verification is done. An independent evalua- PICO design flow is shown in Figure 3. With integration
tion of HLS tools for Xilinx FPGAs has been done by
Berkeley Design Technology [15]. It shows that using
HLS tools with FPGAs can improve the performances of
an application by an order of magnitude compared to
DSPs. Moreover, this study shows that for a given appli-
cation, HSL tools will achieve similar results compared to
hand-written HDL code with a shorter development time.
HLS software based approach for simulation and veri-
fication is made possible by using SystemC, a language
developed by Synopsys, University of California Irvine,
Frontier Design and IMEC. SystemC is an extension of
C++ that provides additional libraries to design an em-
bedded system. The first version was released in 1999
and in 2005 it became IEEE standardized SystemC [16,
17] as the IEEE-1666-2005. These additional libraries
make it possible to specify the hardware and software
components in an embedded system using one unified
paradigm and to generate testbenches.

Figure 1. FPGA high level synthesis block diagram. Figure 2. High level synthesis (HLS) design flow.

Copyright © 2012 SciRes. CS


E. ORUKLU ET AL. 3

Figure 3. HLS (PICO) based design flow for hardware implementation.

of the PICO design tools to their FPGA flow, designers sult of these constraints, the reference code included in
can create complex hardware [20] sub-systems from se- this section does not use divisions, is completely iterative,
quential untimed C algorithms. It allows designers to and has not pointer variables. However, before going into
explore programmability, performance, power, area and the details of the implementation, the mathematical back-
clock frequency. This is achieved by providing a com- ground of the FFT is presented.
prehensive and robust verification and validation envi-
ronment. PICO is designed to explore different type of 3.1. FFT Algorithm
parallelism and will choose the optimal one transparently.
The Fourier transform takes a signal x in time t and trans-
Results in terms of throughput and area are given along
forms it into a function X in frequency ω:
with detailed reports that will help the user for code op-

timization. When the synthesized performances are sat-
 x(t ) * e
2 jπ t
isfactory, RTL code is generated and can be implemented X ( )  dt (1)

in the targeted platform. Because the testing is done in C,
the verification time of the RTL module can be signifi- The transform can be computed using a Discrete Fou-
cantly reduced [20]. rier Transform (DFT).
N 1 n
2 jπk
3. Fast Fourier Transform X k   xn e N
(2)
n 0
In most cases, the first step when using an HLS tool is to
where k  0, , N  1.
create a reference implementation, which is used to ver-
The direct realization of DFT algorithm requires O(N2)
ify the synthesized product. The reference code itself can
computational time. To make this computation faster, an
be compiled using any C compiler, and is purely soft-
entire class of Fast Fourier Transforms (FFT) were de-
ware based. This means that no new concepts have to be
veloped [8]. However, in this paper a radix-2 FFT deci-
taught, making the reference implementation a logical
mated in time is implemented. This algorithm divides the
starting point when using HLS.
original DFT into two DFTs with half the length (i.e.
When creating the reference code for FFT, there are
decimation). The first step in decimation is shown below:
few issues that need to be addressed when using HLS
tools. The first issue is that arithmetic operations such as N
1 2πj
N
1 2πj
2   2 m k 2   2 m 1 k
division can significantly decrease the performance of
the design, and therefore should be avoided whenever
Xk   x2me N
  x2 m1e N
(3)
m 0 m 0
possible. Nevertheless, division by a power of two is
Then the algorithm is recursively applied to each term
considered as a bit shift operation and hence can be used
until each DFT’s length is 1. This recursive deconstruc-
at no cost. The second issue, and more fundamental issue,
tion of the DFT makes the computational time of
is that pointers and recursion are not supported by the
O(Nlog(N)) [8].
current HLS tools due to the fact that those concepts are
purely software and can’t be applied to hardware designs.
3.2. Software Implementation of the FFT
Finally, HLS tools may not have the capability to synthe-
size software functions such as cosine and sine. As a re- In Figure 4, a 16-point radix-2 FFT is shown. A signal is

Copyright © 2012 SciRes. CS


4 E. ORUKLU ET AL.

[0] [0]
0 # include " fft.h "
2
[8] [1] # include  math.h 
0
[1]
4
[2] # define pi 2 (double)6.28318530717958647692528676655901
1
0
2 4 extern s _ complex fix _ float[ N / 2];
[9] [3]
0 void table _ setup (void )
8
[2] [4]
0
{
1
2 8
[10] [5] double a  0.0;
0 2
[3]
4 8
[6]
double e   pi 2 / N ;
0
2 4
1 3
8
float cos _ val , sin _ val ;
[11] [7]
0
int i;
16
[4] [8] for (i  0; i  N / 2; i  ){
0 1
[12]
2 16
[9] cos _ val  cos(a );
4
0 2
16 sin _ val  sin(a );
[5] [10]
0
2
1 3 fix _ float[i ].x  cos _ val ;
4 16
[13] [11]
4
fix _ float[i ]. y  sin _ val ;
0
8 16
[6] [12] a  a  e;
0 1 5
[14]
2 8 16
[13]
}
4
0
8
2 6
16 }
[7] [14]
0 1 3 7
2 4 8 16
[15] [15]
The particular implementation chosen for this refer-
Figure 4. 16-point radix-2 FFT. ence FFT was provided by [22]. The exact code used is
shown in Figure 5. N represents the length of the FFT
inputted into the FFT in a bit reversed order and then and must be a power of 2. Before using the function
goes through log2(N) passes, where each pass has N/2 fft_ref, the function table_setup must be executed in or-
“butterfly” operations. These butterfly operations are der to compute the twiddle factors and store them in the
defined as: array fix_float. The FFT of an input z can then be exe-
cuted. The first phase is the bit-reverse operation where
f O the input data are rearranged as show in Figure 4. Then,
WNk for each passes, the butterfly operations are performed
g G until the FFT is completed. In the next section this code
will be made fully synthesizable by applying four modi-
WNk  e 2πjk N
(called the Twiddle factor) fications to it.
F  f  WNk  g
(4) 4. Code Modification for HLS
G  f  WNk  g
The objective of this section is to generate the hardware
The butterfly operation requires complex number of a FFT block based on the reference C code using HLS
arithmetic additions and multiplications. Because of the tools. Multiple modifications are needed in order to ge-
programming constraints placed on the reference code, nerate an optimal hardware in term of resource usage and
most complex number libraries are not useable. Hence, throughput. As an example, we generate an 8-bit 1024-
this reference code uses its own complex number repre- point radix-2 FFT. The output is on 18 bits and will
sentation shown below: beavailable in natural order. The size of the data width
inside the FFT has been chosen so that the HLS FFT
typedef struct{ gives the same results as the Xilinx FFT core [23].
float x ;
4.1. Floating Point to Fixed Point Implementation
float y ;
Since the reference C code is using floating point num-
}s _ complex;
bers, a fixed-point library is needed. For example, PICO,
the HLS used in this demonstration, provides such library.
Moreover, in order to perform the butterfly operation, The PICO fixed-point arithmetic library derives its se-
the WNk terms need to be calculated. Since we assume mantics from the SystemC fixed-point library and it sup-
that HLS library does not support cosine and sine func- ports signed and unsigned arithmetic operations. Hence,
tions, the twiddle factors are pre-computed and stored in the previous floating point complex structure must be
a table using the code below: modified as followed:

Copyright © 2012 SciRes. CS


E. ORUKLU ET AL. 5

# < ℎ. ℎ > used. PICO supports two types of streams: external and
# " . ℎ"
internal. External streams are used to stream data from/to
_ _ [ /2]; global memory and/or other blocks in the system. Inter-
_ ( , , _ ∗ ) nal streams are used to stream data between loops within
{
, , , 1, 2; a multi-loop accelerator designed by PICO. In PICO,
, , 1, 2;
= 0; streams are specified using explicit procedure calls that
2 = / 2;
( = 1; < − 1; + +) {
transmit a scalar value to an output stream or receive a
1 = 2; scalar value from an input stream. These procedures are
ℎ ( >= 1) {
= − 1; converted into special opcodes that receive (transmit)
1 = 1 / 2; Bit-reverse
} data from (to) actual streams. For the FFT application,
operation
= + 1;
( < ){
four streams are needed: input/output streams for real
_ = [ ]; and imaginary parts:
[ ] = [ ];
[] = ;
}
}
char pico _ stream _ input _ xin();
1 = 0; char pico _ stream _ input _ yin();
2 = 1;
( = 0; < ; + +){ void pico _ stream _ output _ xout (int);
1 = 2;
2 = 2 + 2; void pico _ stream _ output _ yout (int);
= 0;
= /(1 << ( + 1));
( = 0; < 1; + +) { Obtain cosine and
=
=
_
_
[
[
]. ;
]. ;
sine values for the PICO synthesizes a FIFO (within the RTL) for each
= + ; butterfly operation internal and external stream in the code. Different para-
( == /2 ) { = 0; }
, 1; meters such as the length of the FIFO can be configured
( = ; < ; = + 2) {
1 = [ + 1]. ∗ ); using pragmas. The first step of the FFT will be the
= [ + 1]. ∗ );
1 −= ; loading phase where input data are stored into a RAM
2 = [ + 1]. ∗ ;
= [ + 1]. ∗ ;
called z as shown below:
2 += ; Butterflycalculation
[ + 1]. = [ ]. − 1; for (h  0; h  N ; h  ){
[ + 1]. = [ ]. − 2;
[ ]. = [ ]. + 1; z[h].x  ( floatP) pico _ stream _ input _ xin();
[ ]. = [ ]. + 2; z  h . y   floatP  pico _ stream _ input _ yin();
}
} }
}
0;
}
Finally after the FFT is computed, the unloading phase
is performed:
Figure 5. FFT reference C code.
for ( p  0; p  N ; p   ){
pico _ stream _ output _ xout ( z[ p ]. x );
typedef pico :: s _ fixed  22,18, pico :: S _ RND, pico :: S _ SAT , 0  floatP;
pico _ stream _ output _ yout ( z[ p ]. y );
typedef struct{
}
floatP x;
floatP y;
}s _ complexP;
4.3. Bit-Reverse Operation
If we look at the reference C code, the next step would be
FFT is computed using 22-bit data width with 18 bits the bit-reverse stage; this operation takes 1024 cycles.
for the integer part and 4 bits for the fractional part. However, it can be integrated in the radix-2 FFT block,
Rounding and saturation configuration is used. The effect hence reducing the total number of cycles required to
of the number of bits allocated to the fractional part on perform the calculations. This can be done using the
the precision and resource usage of the FFT HLS is pre- bit_swap function:
sented in Section V. The twiddle factors are pre-calcu-
lated with a precision of 16 bits and stored in an array unsigned short bit _ swap (unsigned short in , unsigned short bits ){
unsigned short out  0;
eliminating the need of trigonometric functions. unsigned short k ;
# pragma unroll
4.2. Input Array to Stream of Input Data for ( k  0; k  bits ; k   ){
out  ( out  1) | (in & 0 x1);
In the reference C code, the input data are passed to the in  in  1;
}
function as an array. This will be translated into memory return out ;
accesses by the HLS tool which is not optimal for hard- }
ware implementation. Hence, a stream of input data is

Copyright © 2012 SciRes. CS


6 E. ORUKLU ET AL.

In this function, we use the pragma unroll to specify to .


.
the HLS tool to unroll the loop and hence parallelize the .
operations to speed-up the process. This function is used .
( = ; < ; = + 2) {
to calculate the new address when performing the butter- _ _ 1, ;
_ _ 1, ;
fly calculation on z as shown below: ( %2) {
= 1[ ];
_ 1 = 1[ + 1];
. } { Multi-buffering
. = [ ];
. _ 1 = [ + 1];
. }
1 = ( _ 1. ∗ − _ 1. ∗ ) ;
( = ; < ; =+ 2) { 2 = ( _ 1. ∗ + _ 1. ∗ );
ℎ _ = _ ( , ); Bit-reverse _ 1. = ( . − 1); Butterfly
ℎ _ _ 1 = _ ( + 1, ); operation
1 = [ _ _ 1]. ∗ );
_ 1. = ( . − 2); calculation
. = ( . + 1);
= [ _ _ 1]. ∗ );
. = ( . + 2);
1 −= ;
( %2) {
2 = [ _ _ 1]. ∗ ;
[ ] = ;
= [ _ _ 1]. ∗ ;
2 += ;
Butterfly [ + 1] = _ 1;
[ _ _ 1]. = [ _ ]. − 1; calculation } { Multi-buffering
[ _ _ 1]. = [ _ ]. − 2; 1[ ] = ;
[ _ ]. = [ _ ]. + 1; 1[ + 1] = _ 1;
[ _ ]. = [ _ ]. + 2; }
} }
} }
} }
0; 0;
} }

By integrating the modifications presented in this sec-


4.4. Memory Access Reduction
tion to the reference C code given in Figure 5, the HLS
Each array of data in the reference C code will be im- implementation of the FFT can be obtained as shown in
plemented as a RAM by the HLS tool. We can see that Figure 6.
multiple accesses of z are done which is not suitable for
hardware implementation since only single or dual port 5. Hardware Synthesis Results
RAMs/ROMs are available. In order to resolve this The HLS tool offers different configurations that will
problem and obtain better performances, the first step is have an impact on the hardware generated. For example,
to use temporary variables. This step is shown below: the user can specify the desired frequency that may or
may not be achieved by the tool depending on the system
.
. targeted and the complexity of the C code. As seen in
.
.
( = ; < ; = + 2) { .
_ _ 1, ; .
.
_ _ 1, ;
.
= [ ]; ( = ; < ; = + 2) {
_ 1 = [ + 1]; _ _ 1, ;
1 = ( _ 1. ∗ − _ 1. ∗ ) ; _ _ 1, ;
2 = ( _ 1. ∗ + _ 1. ∗ ); Butterfly ( %2) {
_ 1. = ( . − 1); calculation = 1[ ];
_ 1. = ( . − 2); _ 1 = 1[ + 1];
} { Multi-buffering
. = ( . + 1);
= [ ];
. = ( . + 2); _ 1 = [ + 1];
[ ]= ; }
[ + 1] = _ 1; 1 = ( _ 1. ∗ − _ 1. ∗ ) ;
} 2 = ( _ 1. ∗ + _ 1. ∗ );
} _ 1. = ( . − 1); Butterfly
} _ 1. = ( . − 2); calculation
. = ( . + 1);
0; . = ( . + 2);
} ( %2) {
[ ] = ;
[ + 1] = _ 1;
} { Multi-buffering
1[ ] = ;
Through this arrangement, the memory accesses are 1[ + 1] = _ 1;
reduced to 2 read and 2 write operations. They can be }
}

reduced further using multi-buffering or ping-pong me- }


}
mories. Therefore, we use two RAMs z and z1 and we 0;
}
alternate read and write operations. For example, a read
operation will be done on z (or z1) while the write opera-
tion will be done on z1 (or z): Figure 6. HLS implementation of FFT.

Copyright © 2012 SciRes. CS


E. ORUKLU ET AL. 7

this section, increasing the frequency will increase the Area reduction in terms of slices and DSP48E blocks
resources of the hardware generated by the HLS tool. can be achieved by increasing the number of clock cycles
The throughput (number of FFTs that can be done in one required to perform the FFT. Hence, for equivalent
second) can also be specified. In order to achieve a high throughput, it is better to choose a higher operational
throughput, the HLS tool will parallelize tasks; hence frequency and a higher number of clock cycles required
increasing the hardware resources. Finally, the user can to perform the FFT. Table 2 shows the hardware usage
specify to implement arrays using block RAMs or look- of the FFT for a targeted frequency of 150MHz with dif-
up tables (LUTs). Hardware implementation results are ferent throughputs. For example, from Table 1, for a
obtained using Xilinx ISE 12.1 software with either frequency of 75 MHz, the throughput is 10,463. Never-
speed or area optimization for Virtex-5 FPGA. The twid- theless, with a frequency of 150 MHz, a better through-
dle factors have been implemented using LUTs but can put can be obtained using fewer DSP48E blocks (see
be also implemented using RAMs. By doing this, it will Table 2, second row).
reduce the total number of slices LUTs but increase the Figure 7 shows the error variation with respect to the
number of blocks RAM/FIFO. Table 1 shows the hard- width of the fractional part compared to the reference
ware usage of the HLS implementation of FFT with code shown in Figure 5. The relative error for the FFT is
22-bit data width for different targeted frequencies. given using the formula below:
One can see a significant increase in terms of logic
slices for 150 MHz operational frequency. This is due to 1 1 99 1023  X ref [n][k ]  X HLS [n][k ]
the fact that we have selected optimization for speed in
error   
100 1024 n0 k 0  X ref [n][k ]
ISE in order to achieve the desired operational frequency (6)
after place and route. For frequencies lower than 150 Yref [n][k ]  YHLS [n][k ] 
 
MHz, optimization in terms of area has been selected. Yref [n][k ] 
For frequencies from 50 MHz to 150 MHz, the total
number of clock cycles achieved by PICO to perform the where X and Y are real and imaginary parts respectively.
1024-point FFT is 7168 but for 175 MHz it is increased The relative error is calculated for 100 random input
to 12288 clock cycles. 7168 clock cycles is the minimum signals of 1024 samples each Figure 7 shows that the
latency that can be obtained and is calculated as follow: relative error decreases linearly as the number of bit for
latency  loading  FFT  unloading the fractional part increases. For the implementation of
the FFT, –40 dB is achieved giving the same results as
N
latency  N  *log 2  N   N (5) the Xilinx FFT core. Nevertheless, the user can increase
2 the precision at the expense of hardware usage. For 13
lantecy  1024  512 *10  1024  7168clock cycles bits, the relative error achieved is –73 dB compared to
For frequencies higher than 150 MHz, PICO reduces the reference C code based on double precision floating
the tasks’ parallelism of the FFT in order to achieve the point operations.
desired frequency. This results in an increase of the la- Table 3 shows the hardware usage with respect to the
tency and a reduction of the hardware resources. The width of the fractional part for a desired operational fre-
maximum frequency that can be obtained by PICO is quency of 100 MHz. As expected, the resource usage
around 270 MHz with a total of 17,408 clock cycles increases with the number of bit for the fractional part.
( 1024  3  512  10  1024 ) to compute the FFT. Never- Nevertheless, the number of blocks RAM/FIFO used is
theless, after place and route, the maximum frequency the same. This is due to the architecture of the Virtex-5
obtained is 180 MHz due to the FPGA targeted. FPGA selected.

Table 1. FFT hardware usage for different frequencies.

Resource usage
Targeted frequency Achieved frequency
Slices Registers Slices LUTs Block RAM/FIFO DSP48E
50 MHz 749 1700 2 4 50 MHz
75 MHz 765 1769 2 4 75 MHz
100 MHz 926 1967 2 4 100 MHz
125 MHz 1042 1714 2 4 125 MHz
150 MHz 1546 2004 2 4 150 MHz
175 MHz 1380 1849 2 2 165 MHz
270 MHz 1457 1989 2 2 180 MHz

Copyright © 2012 SciRes. CS


8 E. ORUKLU ET AL.

Table 2. FFT hardware usage for different throughputs. code. Results of the generated FFT for a Virtex-5 FPGA
have been presented. FFT has a broad range of appli-
Targeted
Resource usage cations in digital signal processing, and multimedia. It is
throughput Slices Slices Blocks a key component that determines most of the design met-
DSP48Es rics in many signal processing communication applica-
Registers LUTs RAM/FIFO
20926 1546 2004 2 4
tions. HLS tools facilitate complex algorithms to be real-
ized at a higher level. They can reduce the design cycle
12207 1351 1693 2 2 significantly while successfully generating results very
8616 1186 1418 2 2 close to handmade HDL design.
6658 1161 1404 2 1
7. Acknowledgements
Relative error for different bit sizes for the fractional part The authors would like to thank the Xilinx, Inc.
–10 (www.xilinx.com) and Synopsys (www.synopsys.com)
for their valuable support.
–20

–30 REFERENCES
Relative error in Db

–40 [1] S. Dongwan, A. Gerstlauer, R. Domer and D. D. Gajski,


“An Interactive Design Environment for C-Based High-
Level Synthesis of RTL Processors,” IEEE Transactions
–50
on Very Large Scale Integration (VLSI) Systems, Vol. 16,
No. 4, 2008, pp. 446-475.
–60
[2] S. Ramachandran, “Digital VLSI System Design,” Chap-
ter 11, Springer, New York, 2007.
–70
[3] M. Glasser, “Open Verification Methodology Cook-
–80 book,” Chapters 1-3, Springer, New York, 2009.
0 2 4 6 8 10 12 14 doi:10.1007/978-1-4419-0968-8
Number of bits for the fractional part
[4] E. Casseau and B. Le Gal, “High-Level Synthesis for the
Figure 7. Relative error for different bit size for the frac- Design of FPGA-Based Signal Processing Systems,” In-
tional part. ternational Symposium on Systems, Architectures, Mod-
eling, and Simulation, SAMOS’09, 20-23 July 2009, pp.
Table 3. FFT hardware usage for different fractional sizes. 25-32.
[5] B. Bailey, G. Martin and A. Piziali, “ESL Design and
Resource usage
Verification,” Morgan Kaufmann, San Francisco, Chap-
Fractional part
number of bits Slices Slices Block ters 1-6, 2007.
DSP48Es
Registers LUTs RAM/FIFO [6] V. Sklyarov and I. Skliarova, “Teaching Reconfigurable
Systems: Methods, Tools, Tutorials, and Projects,” IEEE
0 767 1305 2 4 Transactions on Education, Vol. 48, No. 2, 2005, pp.
2 809 1474 2 4 290-300. doi:10.1109/TE.2004.842909
[7] L. E. M. Brackenbury, L. A. Plana and J. Pepper, “System-
4 926 1967 2 4 on-Chip Design and Implementation,” IEEE Transactions
6 968 2184 2 4 on Education, Vol. 53, No. 2, 2010, pp. 272-281.
doi:10.1109/TE.2009.2014858
8 1188 2309 2 4
[8] J. G. Proakis and D. G. Manolakis, “Digital Processing
10 1279 2405 2 8 4th Edition,” 4th Edition, Prentice Hall, New Jersey,
2006.
13 1400 2607 2 8
[9] Xilinx Inc., “XUPV5 Development Board,”
http://www.xilinx.com
6. Conclusion [10] Synfora Inc, Website. http://www.synfora.com
In this paper, we have presented hardware considerations [11] Synopsys, “Symphony C Compiler,”
http://www.synopsys.com/Tools/SLD/HLS/Pages/Synpho
that software engineers need to apply when designing
nyC-Compiler.aspx
hardware modules using HLS tools. As a demonstration,
[12] AutoESL, Website. http://www.autoesl.com
the implementation of a radix-2 FFT unit has been pre-
sented. We have shown the different steps to achieve an [13] Mentor, Website. http://www.mentor.com
optimized C code for HLS tools based on an ANSI C [14] P. Avss, S. Prasant and R. Jain, “Virtual Prototyping In-

Copyright © 2012 SciRes. CS


E. ORUKLU ET AL. 9

creases Productivity—A Case Study,” IEEE Interna- [19] S. Van Haastregt and B. Kienhuis, “Automated Synthesis
tional Symposium on VLSI Design, Automation and Test, of Streaming C Applications to Process Networks in
Hsinchu, 28-30 April 2009, pp. 96-101. Hardware,” Proceedings of the Conference on Design
doi:10.1109/VDAT.2009.5158104 Automation & Test in Europe, April 2009, pp. 890-893.
[15] Berkeley Design Technology, “An independent Evalua- [20] P. Coussy and A. Morawiec, “High-Level Synthesis:
tion of High-Level Synthesis Tools for Xilinx FPGAs,” From Algorithm to Digital Circuits,” Springer Science +
http://www.bdti.com Business Media, Chapters 1, 4, Berlin, 2008.
[16] K. L. Man, “An overview of SystemCFL,” Research in [21] N. Hatami, A. Ghofrani, P. Prinetto and Z. Navabi, “TLM
Microelectronics and Electronics, 2005 PhD, Vol. 1, 2.0 Simple Sockets Synthesis to RTL,” International
2005, pp. 145-148. Conference on Design & Technology of Integrated Sys-
tems in Nanoscale Era, Vol. 1, 2000, pp. 232-235.
[17] P. Schumacher, M. Mattavelli, A. Chirila-Rus and R.
Turney, “A Software/Hardware Platform for Rapid Pro- [22] D. L. Jones, “FFT Reference C Code,” University of Illi-
totyping of Video and Multimedia Designs,” Proceedings nois at Urbana-Champaign, 1992.
of Fifth International Workshop on System-on-Chip for [23] Xilinx Inc., “CoreGen,” http://www.xilinx.com
Real-Time Applications, 20-24 July 2005, pp. 30-33.
doi:10.1109/IWSOC.2005.27
[18] W. Chen (Ed.), “The VLSI Handbook,” 2nd Edition,
Chapter 86, CRC Press LCC, Boca Raton, 2007.

Copyright © 2012 SciRes. CS

You might also like