System-On-Chip Design Using High-Level Synthesis Tools
System-On-Chip Design Using High-Level Synthesis Tools
Received August 13, 2011; revised November 20, 2011; accepted November 30, 2011
ABSTRACT
This paper addresses the challenges of System-on-Chip designs using High-Level Synthesis (HLS). HLS tools convert
algorithms designed in C into hardware modules. This approach is a practical choice for developing complex applica-
tions. Nevertheless, certain hardware considerations are required when writing C applications for HLS tools. Hence, in
order to demonstrate the fundamental hardware design concepts, a case study is presented. Fast Fourier Transform (FFT)
implementation in ANSI C is examined in order to explore the important design issues such as concurrency, data recur-
rences and memory accesses that need to be resolved before generating the hardware using HLS tools. There are addi-
tional language constraints that need to be addressed including use of pointers, recursion and floating point types.
Keywords: System Level Design; High Level Synthesis; Field Programmable Gate Arrays; Fourier Transform
been accomplished using Hardware Descriptive Lan- Focusing further on HLS, the design flow is shown in
guages such as VHDL or Verilog. Each expression in Figure 2. Each module of a system is implemented using
HDL represents a group of gates that operate in parallel, high level languages such as C, C++, Java, or Matlab
as opposed to machine instructions executed sequentially. [2,18], which can then be tested automatically with test-
This concept of instruction level parallelism is one of the benches provided by the user. After verification of the
first major hurdles when introducing hardware concepts. complete system, the user can specify in the HLS tool
Once an RTL module is designed, it can be compiled which modules will be converted into hardware accel-
and simulated. The simulation is done by creating a se- erators in order to speed up the application. This is one of
ries of pre-defined inputs, known as a testbench, and re- the core elements of hardware/software co-design that
cording the outputs. If a module passes the simulation software developers need to understand. There are in-
then a low level implementation can be created. This low herent restrictions in the HDLs that are mirrored in the
level implementation then enters the verification process HLS tool. Therefore, the emphasis for teaching HDL to
to ensure that all timing dependencies are met. In prac- software developers is on its constraints and how it af-
tice, simulating and verifying an implementation can take fects the HLS tools.
50% - 60% of the development time, increasing the time- After generation of the hardware modules along with
to-market (TTM) [14]. By automating the simulation and testbenches, the system is verified and can be imple-
verification process, it is possible to greatly reduce the mented using synthesis tools.
development time. This paper, as mentioned earlier, focuses on designing
Integration of HLS tools into the FPGA or ASIC de- a Fast Fourier Transform. The concept of HLS is pre-
sign flow, as shown in Figure 1, allows software design- sented by using PICO (Program-In Chip-Out) Extreme
ers to build hardware modules and speed up the TTM from Synfora [10,19,20] to generate the RTL code of an
significantly. During the generation process of an RTL FFT. To be specific, PICO takes a C-based description of
module from a software implementation, simulation and an algorithm and generates: performance-driven device-
verification are done automatically by using a formal dependent synthesizable RTL code, testbench files, ap-
proof provided during the initial steps. Subsequently, by plication drivers, simulation scripts as well as SystemC
using synthesis tools, the RTL module is implemented based Transaction Level Models (TLM) [3,17,18,21].
and timing verification is done. An independent evalua- PICO design flow is shown in Figure 3. With integration
tion of HLS tools for Xilinx FPGAs has been done by
Berkeley Design Technology [15]. It shows that using
HLS tools with FPGAs can improve the performances of
an application by an order of magnitude compared to
DSPs. Moreover, this study shows that for a given appli-
cation, HSL tools will achieve similar results compared to
hand-written HDL code with a shorter development time.
HLS software based approach for simulation and veri-
fication is made possible by using SystemC, a language
developed by Synopsys, University of California Irvine,
Frontier Design and IMEC. SystemC is an extension of
C++ that provides additional libraries to design an em-
bedded system. The first version was released in 1999
and in 2005 it became IEEE standardized SystemC [16,
17] as the IEEE-1666-2005. These additional libraries
make it possible to specify the hardware and software
components in an embedded system using one unified
paradigm and to generate testbenches.
Figure 1. FPGA high level synthesis block diagram. Figure 2. High level synthesis (HLS) design flow.
of the PICO design tools to their FPGA flow, designers sult of these constraints, the reference code included in
can create complex hardware [20] sub-systems from se- this section does not use divisions, is completely iterative,
quential untimed C algorithms. It allows designers to and has not pointer variables. However, before going into
explore programmability, performance, power, area and the details of the implementation, the mathematical back-
clock frequency. This is achieved by providing a com- ground of the FFT is presented.
prehensive and robust verification and validation envi-
ronment. PICO is designed to explore different type of 3.1. FFT Algorithm
parallelism and will choose the optimal one transparently.
The Fourier transform takes a signal x in time t and trans-
Results in terms of throughput and area are given along
forms it into a function X in frequency ω:
with detailed reports that will help the user for code op-
timization. When the synthesized performances are sat-
x(t ) * e
2 jπ t
isfactory, RTL code is generated and can be implemented X ( ) dt (1)
in the targeted platform. Because the testing is done in C,
the verification time of the RTL module can be signifi- The transform can be computed using a Discrete Fou-
cantly reduced [20]. rier Transform (DFT).
N 1 n
2 jπk
3. Fast Fourier Transform X k xn e N
(2)
n 0
In most cases, the first step when using an HLS tool is to
where k 0, , N 1.
create a reference implementation, which is used to ver-
The direct realization of DFT algorithm requires O(N2)
ify the synthesized product. The reference code itself can
computational time. To make this computation faster, an
be compiled using any C compiler, and is purely soft-
entire class of Fast Fourier Transforms (FFT) were de-
ware based. This means that no new concepts have to be
veloped [8]. However, in this paper a radix-2 FFT deci-
taught, making the reference implementation a logical
mated in time is implemented. This algorithm divides the
starting point when using HLS.
original DFT into two DFTs with half the length (i.e.
When creating the reference code for FFT, there are
decimation). The first step in decimation is shown below:
few issues that need to be addressed when using HLS
tools. The first issue is that arithmetic operations such as N
1 2πj
N
1 2πj
2 2 m k 2 2 m 1 k
division can significantly decrease the performance of
the design, and therefore should be avoided whenever
Xk x2me N
x2 m1e N
(3)
m 0 m 0
possible. Nevertheless, division by a power of two is
Then the algorithm is recursively applied to each term
considered as a bit shift operation and hence can be used
until each DFT’s length is 1. This recursive deconstruc-
at no cost. The second issue, and more fundamental issue,
tion of the DFT makes the computational time of
is that pointers and recursion are not supported by the
O(Nlog(N)) [8].
current HLS tools due to the fact that those concepts are
purely software and can’t be applied to hardware designs.
3.2. Software Implementation of the FFT
Finally, HLS tools may not have the capability to synthe-
size software functions such as cosine and sine. As a re- In Figure 4, a 16-point radix-2 FFT is shown. A signal is
[0] [0]
0 # include " fft.h "
2
[8] [1] # include math.h
0
[1]
4
[2] # define pi 2 (double)6.28318530717958647692528676655901
1
0
2 4 extern s _ complex fix _ float[ N / 2];
[9] [3]
0 void table _ setup (void )
8
[2] [4]
0
{
1
2 8
[10] [5] double a 0.0;
0 2
[3]
4 8
[6]
double e pi 2 / N ;
0
2 4
1 3
8
float cos _ val , sin _ val ;
[11] [7]
0
int i;
16
[4] [8] for (i 0; i N / 2; i ){
0 1
[12]
2 16
[9] cos _ val cos(a );
4
0 2
16 sin _ val sin(a );
[5] [10]
0
2
1 3 fix _ float[i ].x cos _ val ;
4 16
[13] [11]
4
fix _ float[i ]. y sin _ val ;
0
8 16
[6] [12] a a e;
0 1 5
[14]
2 8 16
[13]
}
4
0
8
2 6
16 }
[7] [14]
0 1 3 7
2 4 8 16
[15] [15]
The particular implementation chosen for this refer-
Figure 4. 16-point radix-2 FFT. ence FFT was provided by [22]. The exact code used is
shown in Figure 5. N represents the length of the FFT
inputted into the FFT in a bit reversed order and then and must be a power of 2. Before using the function
goes through log2(N) passes, where each pass has N/2 fft_ref, the function table_setup must be executed in or-
“butterfly” operations. These butterfly operations are der to compute the twiddle factors and store them in the
defined as: array fix_float. The FFT of an input z can then be exe-
cuted. The first phase is the bit-reverse operation where
f O the input data are rearranged as show in Figure 4. Then,
WNk for each passes, the butterfly operations are performed
g G until the FFT is completed. In the next section this code
will be made fully synthesizable by applying four modi-
WNk e 2πjk N
(called the Twiddle factor) fications to it.
F f WNk g
(4) 4. Code Modification for HLS
G f WNk g
The objective of this section is to generate the hardware
The butterfly operation requires complex number of a FFT block based on the reference C code using HLS
arithmetic additions and multiplications. Because of the tools. Multiple modifications are needed in order to ge-
programming constraints placed on the reference code, nerate an optimal hardware in term of resource usage and
most complex number libraries are not useable. Hence, throughput. As an example, we generate an 8-bit 1024-
this reference code uses its own complex number repre- point radix-2 FFT. The output is on 18 bits and will
sentation shown below: beavailable in natural order. The size of the data width
inside the FFT has been chosen so that the HLS FFT
typedef struct{ gives the same results as the Xilinx FFT core [23].
float x ;
4.1. Floating Point to Fixed Point Implementation
float y ;
Since the reference C code is using floating point num-
}s _ complex;
bers, a fixed-point library is needed. For example, PICO,
the HLS used in this demonstration, provides such library.
Moreover, in order to perform the butterfly operation, The PICO fixed-point arithmetic library derives its se-
the WNk terms need to be calculated. Since we assume mantics from the SystemC fixed-point library and it sup-
that HLS library does not support cosine and sine func- ports signed and unsigned arithmetic operations. Hence,
tions, the twiddle factors are pre-computed and stored in the previous floating point complex structure must be
a table using the code below: modified as followed:
# < ℎ. ℎ > used. PICO supports two types of streams: external and
# " . ℎ"
internal. External streams are used to stream data from/to
_ _ [ /2]; global memory and/or other blocks in the system. Inter-
_ ( , , _ ∗ ) nal streams are used to stream data between loops within
{
, , , 1, 2; a multi-loop accelerator designed by PICO. In PICO,
, , 1, 2;
= 0; streams are specified using explicit procedure calls that
2 = / 2;
( = 1; < − 1; + +) {
transmit a scalar value to an output stream or receive a
1 = 2; scalar value from an input stream. These procedures are
ℎ ( >= 1) {
= − 1; converted into special opcodes that receive (transmit)
1 = 1 / 2; Bit-reverse
} data from (to) actual streams. For the FFT application,
operation
= + 1;
( < ){
four streams are needed: input/output streams for real
_ = [ ]; and imaginary parts:
[ ] = [ ];
[] = ;
}
}
char pico _ stream _ input _ xin();
1 = 0; char pico _ stream _ input _ yin();
2 = 1;
( = 0; < ; + +){ void pico _ stream _ output _ xout (int);
1 = 2;
2 = 2 + 2; void pico _ stream _ output _ yout (int);
= 0;
= /(1 << ( + 1));
( = 0; < 1; + +) { Obtain cosine and
=
=
_
_
[
[
]. ;
]. ;
sine values for the PICO synthesizes a FIFO (within the RTL) for each
= + ; butterfly operation internal and external stream in the code. Different para-
( == /2 ) { = 0; }
, 1; meters such as the length of the FIFO can be configured
( = ; < ; = + 2) {
1 = [ + 1]. ∗ ); using pragmas. The first step of the FFT will be the
= [ + 1]. ∗ );
1 −= ; loading phase where input data are stored into a RAM
2 = [ + 1]. ∗ ;
= [ + 1]. ∗ ;
called z as shown below:
2 += ; Butterflycalculation
[ + 1]. = [ ]. − 1; for (h 0; h N ; h ){
[ + 1]. = [ ]. − 2;
[ ]. = [ ]. + 1; z[h].x ( floatP) pico _ stream _ input _ xin();
[ ]. = [ ]. + 2; z h . y floatP pico _ stream _ input _ yin();
}
} }
}
0;
}
Finally after the FFT is computed, the unloading phase
is performed:
Figure 5. FFT reference C code.
for ( p 0; p N ; p ){
pico _ stream _ output _ xout ( z[ p ]. x );
typedef pico :: s _ fixed 22,18, pico :: S _ RND, pico :: S _ SAT , 0 floatP;
pico _ stream _ output _ yout ( z[ p ]. y );
typedef struct{
}
floatP x;
floatP y;
}s _ complexP;
4.3. Bit-Reverse Operation
If we look at the reference C code, the next step would be
FFT is computed using 22-bit data width with 18 bits the bit-reverse stage; this operation takes 1024 cycles.
for the integer part and 4 bits for the fractional part. However, it can be integrated in the radix-2 FFT block,
Rounding and saturation configuration is used. The effect hence reducing the total number of cycles required to
of the number of bits allocated to the fractional part on perform the calculations. This can be done using the
the precision and resource usage of the FFT HLS is pre- bit_swap function:
sented in Section V. The twiddle factors are pre-calcu-
lated with a precision of 16 bits and stored in an array unsigned short bit _ swap (unsigned short in , unsigned short bits ){
unsigned short out 0;
eliminating the need of trigonometric functions. unsigned short k ;
# pragma unroll
4.2. Input Array to Stream of Input Data for ( k 0; k bits ; k ){
out ( out 1) | (in & 0 x1);
In the reference C code, the input data are passed to the in in 1;
}
function as an array. This will be translated into memory return out ;
accesses by the HLS tool which is not optimal for hard- }
ware implementation. Hence, a stream of input data is
this section, increasing the frequency will increase the Area reduction in terms of slices and DSP48E blocks
resources of the hardware generated by the HLS tool. can be achieved by increasing the number of clock cycles
The throughput (number of FFTs that can be done in one required to perform the FFT. Hence, for equivalent
second) can also be specified. In order to achieve a high throughput, it is better to choose a higher operational
throughput, the HLS tool will parallelize tasks; hence frequency and a higher number of clock cycles required
increasing the hardware resources. Finally, the user can to perform the FFT. Table 2 shows the hardware usage
specify to implement arrays using block RAMs or look- of the FFT for a targeted frequency of 150MHz with dif-
up tables (LUTs). Hardware implementation results are ferent throughputs. For example, from Table 1, for a
obtained using Xilinx ISE 12.1 software with either frequency of 75 MHz, the throughput is 10,463. Never-
speed or area optimization for Virtex-5 FPGA. The twid- theless, with a frequency of 150 MHz, a better through-
dle factors have been implemented using LUTs but can put can be obtained using fewer DSP48E blocks (see
be also implemented using RAMs. By doing this, it will Table 2, second row).
reduce the total number of slices LUTs but increase the Figure 7 shows the error variation with respect to the
number of blocks RAM/FIFO. Table 1 shows the hard- width of the fractional part compared to the reference
ware usage of the HLS implementation of FFT with code shown in Figure 5. The relative error for the FFT is
22-bit data width for different targeted frequencies. given using the formula below:
One can see a significant increase in terms of logic
slices for 150 MHz operational frequency. This is due to 1 1 99 1023 X ref [n][k ] X HLS [n][k ]
the fact that we have selected optimization for speed in
error
100 1024 n0 k 0 X ref [n][k ]
ISE in order to achieve the desired operational frequency (6)
after place and route. For frequencies lower than 150 Yref [n][k ] YHLS [n][k ]
MHz, optimization in terms of area has been selected. Yref [n][k ]
For frequencies from 50 MHz to 150 MHz, the total
number of clock cycles achieved by PICO to perform the where X and Y are real and imaginary parts respectively.
1024-point FFT is 7168 but for 175 MHz it is increased The relative error is calculated for 100 random input
to 12288 clock cycles. 7168 clock cycles is the minimum signals of 1024 samples each Figure 7 shows that the
latency that can be obtained and is calculated as follow: relative error decreases linearly as the number of bit for
latency loading FFT unloading the fractional part increases. For the implementation of
the FFT, –40 dB is achieved giving the same results as
N
latency N *log 2 N N (5) the Xilinx FFT core. Nevertheless, the user can increase
2 the precision at the expense of hardware usage. For 13
lantecy 1024 512 *10 1024 7168clock cycles bits, the relative error achieved is –73 dB compared to
For frequencies higher than 150 MHz, PICO reduces the reference C code based on double precision floating
the tasks’ parallelism of the FFT in order to achieve the point operations.
desired frequency. This results in an increase of the la- Table 3 shows the hardware usage with respect to the
tency and a reduction of the hardware resources. The width of the fractional part for a desired operational fre-
maximum frequency that can be obtained by PICO is quency of 100 MHz. As expected, the resource usage
around 270 MHz with a total of 17,408 clock cycles increases with the number of bit for the fractional part.
( 1024 3 512 10 1024 ) to compute the FFT. Never- Nevertheless, the number of blocks RAM/FIFO used is
theless, after place and route, the maximum frequency the same. This is due to the architecture of the Virtex-5
obtained is 180 MHz due to the FPGA targeted. FPGA selected.
Resource usage
Targeted frequency Achieved frequency
Slices Registers Slices LUTs Block RAM/FIFO DSP48E
50 MHz 749 1700 2 4 50 MHz
75 MHz 765 1769 2 4 75 MHz
100 MHz 926 1967 2 4 100 MHz
125 MHz 1042 1714 2 4 125 MHz
150 MHz 1546 2004 2 4 150 MHz
175 MHz 1380 1849 2 2 165 MHz
270 MHz 1457 1989 2 2 180 MHz
Table 2. FFT hardware usage for different throughputs. code. Results of the generated FFT for a Virtex-5 FPGA
have been presented. FFT has a broad range of appli-
Targeted
Resource usage cations in digital signal processing, and multimedia. It is
throughput Slices Slices Blocks a key component that determines most of the design met-
DSP48Es rics in many signal processing communication applica-
Registers LUTs RAM/FIFO
20926 1546 2004 2 4
tions. HLS tools facilitate complex algorithms to be real-
ized at a higher level. They can reduce the design cycle
12207 1351 1693 2 2 significantly while successfully generating results very
8616 1186 1418 2 2 close to handmade HDL design.
6658 1161 1404 2 1
7. Acknowledgements
Relative error for different bit sizes for the fractional part The authors would like to thank the Xilinx, Inc.
–10 (www.xilinx.com) and Synopsys (www.synopsys.com)
for their valuable support.
–20
–30 REFERENCES
Relative error in Db
creases Productivity—A Case Study,” IEEE Interna- [19] S. Van Haastregt and B. Kienhuis, “Automated Synthesis
tional Symposium on VLSI Design, Automation and Test, of Streaming C Applications to Process Networks in
Hsinchu, 28-30 April 2009, pp. 96-101. Hardware,” Proceedings of the Conference on Design
doi:10.1109/VDAT.2009.5158104 Automation & Test in Europe, April 2009, pp. 890-893.
[15] Berkeley Design Technology, “An independent Evalua- [20] P. Coussy and A. Morawiec, “High-Level Synthesis:
tion of High-Level Synthesis Tools for Xilinx FPGAs,” From Algorithm to Digital Circuits,” Springer Science +
http://www.bdti.com Business Media, Chapters 1, 4, Berlin, 2008.
[16] K. L. Man, “An overview of SystemCFL,” Research in [21] N. Hatami, A. Ghofrani, P. Prinetto and Z. Navabi, “TLM
Microelectronics and Electronics, 2005 PhD, Vol. 1, 2.0 Simple Sockets Synthesis to RTL,” International
2005, pp. 145-148. Conference on Design & Technology of Integrated Sys-
tems in Nanoscale Era, Vol. 1, 2000, pp. 232-235.
[17] P. Schumacher, M. Mattavelli, A. Chirila-Rus and R.
Turney, “A Software/Hardware Platform for Rapid Pro- [22] D. L. Jones, “FFT Reference C Code,” University of Illi-
totyping of Video and Multimedia Designs,” Proceedings nois at Urbana-Champaign, 1992.
of Fifth International Workshop on System-on-Chip for [23] Xilinx Inc., “CoreGen,” http://www.xilinx.com
Real-Time Applications, 20-24 July 2005, pp. 30-33.
doi:10.1109/IWSOC.2005.27
[18] W. Chen (Ed.), “The VLSI Handbook,” 2nd Edition,
Chapter 86, CRC Press LCC, Boca Raton, 2007.