High Level Synthesis A Use Case Comparison With Hardware Descrip
High Level Synthesis A Use Case Comparison With Hardware Descrip
ScholarWorks@GVSU
4-2015
ScholarWorks Citation
Zwagerman, Michael D., "High Level Synthesis, a Use Case Comparison with Hardware Description
Language" (2015). Masters Theses. 755.
https://scholarworks.gvsu.edu/theses/755
This Thesis is brought to you for free and open access by the Graduate Research and Creative Practice at
ScholarWorks@GVSU. It has been accepted for inclusion in Masters Theses by an authorized administrator of
ScholarWorks@GVSU. For more information, please contact [email protected].
High Level Synthesis, a Use Case Comparison with Hardware Description Language
Michael Zwagerman
In
School of Engineering
April 2015
Abstract
This paper compares Vivado High-Level Synthesis (HLS), a new mainstream technology offered
by Xilinx Inc., against the typical Hardware Description Language (HDL) design approach. An
example video filter application was implemented via both methods and compared for
differences in performance and Non-Reoccurring Engineering (NRE). Lessons learned using
HLS are also provided. The objective of this paper is to provide actual comparison data on the
current state of mainstream HLS to enable informed decision making for designs considering
HLS.
The Xilinx Zync System on a Chip (SoC) offering is used as a platform for both the traditional
HDL methods and HLS. This platform includes Field Programmable Gate Array (FPGA) fabric
combined with a high speed application microprocessor. These single silicon SoC solutions
appear to be a platform capable of effectively utilizing HLS. The example video application
selected for implementation is a 9 by 9 kernel convolution filter performed on 24 bit 1080p video
at 60 frames per second. The 2013 Xilinx Vivado tool suite was used for both HLS and HDL
methods.
HLS proved to be very easy to use to create a functional RTL design. With naïve
implementations in both, HLS did not perform well in resource utilization. HLS also provided a
design with a slower maximum clock frequency.
3
Contents
ABSTRACT .................................................................................................................................................................3
INTRODUCTION .......................................................................................................................................................8
IMPLEMENTATION ............................................................................................................................................... 13
ANALYSIS ................................................................................................................................................................. 29
1. RESOURCES .................................................................................................................................................... 29
Optimal Routing Utilization ............................................................................................................................... 29
Congested Routing Utilization ........................................................................................................................... 30
Summary ............................................................................................................................................................ 32
2. SPEED............................................................................................................................................................. 32
Summary ............................................................................................................................................................ 32
3. NON-RECURRING ENGINEERING ..................................................................................................................... 32
Summary ............................................................................................................................................................ 33
ANALYSIS SUMMARY .............................................................................................................................................. 33
CONCLUSION .......................................................................................................................................................... 34
4
Figures
FIGURE 1: DESIGN TIME VS APPLICATION PERFORMANCE WITH RTL DESIGN ENTRY ...................................................9
FIGURE 2: DESIGN TIME VS. APPLICATION PERFORMANCE WITH VIVADO HLS COMPILER .......................................... 10
FIGURE 3: HLS BASIC FLOW [6]................................................................................................................................... 10
FIGURE 4: EMBEDDED PLATFORM ALGORITHMIC LIFE CYCLE ..................................................................................... 11
FIGURE 5: ZYNQ-7000 EPP ZC702 EVALUATION KIT ................................................................................................ 12
FIGURE 6: 9X9 KERNEL COEFFICIENTS ......................................................................................................................... 13
FIGURE 7: KERNEL CONVOLUTION ............................................................................................................................... 14
FIGURE 8: EXAMPLE RAW (PRE FILTER) FULL IMAGE .................................................................................................. 15
FIGURE 9: EXAMPLE IMAGE POST KERNEL FILTER FULL IMAGE .................................................................................. 15
FIGURE 10: EXAMPLE RAW (PRE FILTER) IMAGE CROPPED TO SHOW DETAIL ............................................................. 16
FIGURE 11: EXAMPLE IMAGE POST KERNEL FILTER CROPPED TO SHOW DETAIL......................................................... 16
FIGURE 12: VIVADO TRD PS SUBSYSTEM WITH VIDEO PROCESSING HIGHLIGHTED ................................................... 18
FIGURE 13: VIVADO TRD VIDEO PROCESSING WITH SOBEL FILTER HIGHLIGHTED ..................................................... 19
FIGURE 14: HLS VERSION INFO ................................................................................................................................... 20
FIGURE 15: HLS KERNEL CONVOLUTION CODE SNIPPET ............................................................................................. 21
FIGURE 16: HLS C SYNTHESIS MEMORY USAGE ......................................................................................................... 23
FIGURE 17: EXAMPLE OF HLS GENERATED VERILOG CODE. ........................................................................................ 24
FIGURE 18: HDL KERNEL COEFFICIENT CODE SNIPPET ............................................................................................... 25
FIGURE 19: HDL KERNEL CONVOLUTION CODE SNIPPET ............................................................................................ 26
FIGURE 20: HDL KERNEL PRODUCT SUM CODE SNIPPET ............................................................................................ 27
FIGURE 21: TRD TEST PATTERN VIDEO (NO FILTER) .................................................................................................. 28
FIGURE 22: TEST PATTERN VIDEO (FILTERED) ............................................................................................................. 28
FIGURE 23: VIVADO 2013.3 VERSION INFO .................................................................................................................. 29
FIGURE 24: EMPTY PROJECT RESOURCE UTILIZATION ................................................................................................. 30
FIGURE 25: VIVADO 2013.4 VERSION INFO .................................................................................................................. 31
FIGURE 26: TRD RESOURCE UTILIZATION ................................................................................................................... 31
5
Tables
TABLE 1: EMPTY PROJECT RESOURCE UTILIZATION .................................................................................................... 30
TABLE 2: TRD RESOURCE UTILIZATION ...................................................................................................................... 31
TABLE 3: EMPTY PROJECT TIMING CLOSURE ............................................................................................................... 32
TABLE 4: NRE TOTALS ................................................................................................................................................ 33
6
Acronyms and Abbreviations
7
Introduction
8
Research Background
Reactively compensating for growth in device capabilities and complexity, HLS offers a design
paradigm shift to a higher abstraction level which could become a more productive solution for
implementing algorithms in an FPGA[1]. HLS tools have been around for over 30 years, but
have not yet been adopted widely in industry[6]. In early 2011 Xilinx purchased AutoESL
Design Technologies which produced the AutoPilot High-Level Synthesis Tool[7] and later
rebranded the tool as Vivado HLS[8]. Xilinx Vivado HLS is a mainstream offering that may
finally subvert traditional HDL methods. It is worth noting that Vivado HLS is not free, (nor
inexpensive), a node locked license of Vivado HLS from Xilinx Inc is sold for $1995 (node-
locked) or $2395 (Floating)[2].
Xilinx suggests that the design time it takes to implement a software application in HLS is much
less than implementing in HDL (RTL). They also indicate that the performance of the HLS
modules is worse, but close to their HDL counterparts. They provide graphs in UG998[9] to
visualize this. These graphs are included in Figure 1 and Figure 2.
9
Figure 2: Design Time vs. Application Performance with Vivado HLS Compiler
HLS tools parse the high level language source code and compile it to an internal representation
called a Control and Data Flow Graph (CDFG). CDFG is optimized based on automatic or
manual algorithms for allocation (allocation of computing and storage resources), scheduling
(clocking and timing), and binding (mapping operations to allocated computational or storage
resources). After optimization is complete, then RTL in the form of HDL is generated [6]. A
basic HLS tool flow is shown in Figure 3.
10
Problem Research
An example, or typical, FPGA solution was needed for this comparative research. A video
application was selected because video applications commonly utilize an FPGA to assist in
image processing due to their computationally intensive pixel based operations which can bog
down an embedded microprocessor. FPGAs are well suited for pixel operations that can be
accomplished in a streaming (or pixel pipelined) manner.
Developing algorithms for an embedded platform commonly go through an embedded platform
algorithmic life cycle. Figure 4 depicts the embedded platform algorithmic life cycle.
Video applications are typically designed on software computation platforms utilizing a high
level language such as Matlab or C++. These computation platforms are typically higher end
computers that provide near real time visual feedback. After an application algorithm has been
designed, it often must be modified or tailored for the FPGA embedded production platform on
which it must be executed in the field. Porting software algorithms to an FPGA is not always
straight forward. Software algorithms are ultimately sequential instructions and are typically
single threaded operations. FPGAs do not efficiently lend themselves to sequential operation.
11
FPGAs consist of hardware resources which are ‘connected’ via a configuration. Utilizing an
FPGA for software like sequential operation can require a significant quantity of resources to
emulate the sequential ordering. To minimize the required quantity of FPGA resources the
originally specified algorithm can be refactored with parallelization or hardware design in mind.
(Note: Typically an FPGA consisting of fewer resources is less expensive.) This type of
refactoring requires skilled FPGA designers and additional effort. Due to the ‘human factor’ in
the refactoring process, it is also possible that the results from the FPGA implementation diverge
from the original design intent. Additional testing and simulation must be performed to provide
confidence in bit-exactness between the implementation and algorithm.
The target hardware platform for the FPGA implementations is the Xilinx 7c702 evaluation
board[4] as pictured in Figure 5.
The ZC702 development board is designed to evaluate the Zynq XC7Z020 SoC component. This
platform was selected because Xilinx provides a base Targeted Reference Design (TRD)[5] to
help get started quickly evaluating the Zynq XC7Z020 SoC on this board. This TRD includes
example source and binaries for running Xilinx Petalinux with an example 1080p video demo
application. This video demo included an example filter project for generating a video processing
IP block.
12
Implementation
An image filter technique of convolving each input image pixel with a kernel was selected as the
example algorithm for implementation. This technique is used for causing a range of image
effects [3]. Kernel convolution creates a new ‘output pixel’ using coefficient weights applied to
an ‘input pixel’ and its neighbor pixels. An output image is created by performing kernel
convolution on every pixel in the input image. The 9x9 kernel used in this paper performs a
blurring filter. The 9x9 kernel size is larger than a more typical 3x3 kernel size used in other
video filters, but most video application utilize multiple kernel operations, and a larger kernel
should amplify the implementation efficiency. With 1080p video, a 9x9 kernel performs
167,961,600 immediate multiplies per color channel per frame. The coefficients used to form the
9x9 kernel used are listed in Figure 6.
The coefficients of the kernel are approximately Gaussian, but were increased to have more
averaging effect. Some pixels were also increased slightly to achieve a matrix coefficient sum of
512. A power of 2 sum is desired because the final intensity normalization is then simply a shift.
The 9x9 kernel convolution process is outlined in Figure 7. The first step is a Hardamard product
(An entry wise product). The values in the resultant matrix are summed and then normalized to
produce a single output pixel.
13
Figure 7: Kernel Convolution
14
Visual Filter Example
Our selected filter implements a blurring effect that is visually evident when comparing the
source and resultant images. Figure 8 through Figure 11 display the effect of the filter on an
image with high detail. Figure 8 and Figure 9 are scaled down 1080p images. The change from
Figure 8 to Figure 9 is subtle because of the relative size of the 9x9 kernel pixels to a 1080p
image Figure 10 and Figure 11 are 500 x 300 pixel cropped subsections. When comparing
Figure 10 to Figure 11 the blurring is pronounced.
15
Figure 10: Example Raw (Pre Filter) Image Cropped to Show Detail
Figure 11: Example Image Post Kernel Filter Cropped to Show Detail
Note: the original full version of Figure 8 was included in the Xilinx Zynq TRD compiled for the
ZC702 evaluation platform.
16
Xilinx TRD
The Xilinx TRD was used as the infrastructure backbone for efficiently utilizing the custom
kernel implementations. According to Xilinx the “TRD is an embedded video processing
application designed to showcase various features and capabilities of the Zynq Z-7020 AP SoC
device for the embedded domain”[5]. As a demo application, the TRD includes multiplicity of
interworking components, most notably: a running OS, a complete ARM configuration, and a
custom logic video subsystem that utilizes a test pattern generator and a 1080p HDMI output.
The TRD provided a functioning system into which the custom logic evaluated in this paper was
inserted.
For Xilinx’s demo purposes the TRD included a custom logic Sobel filter that performed two
3x3 kernel convolutions. This Sobel filter was replaced with the custom 9x9 blurring filter.
The TRD video subsystem is not rudimentary. The video filter is only a single component of the
video subsystem. Figure 12 is the block diagram of the Vivado TRD Processing System (PS)
subsystem with the video processing block highlighted. This block diagram is only an abstracted
view of the internal modules. Figure 13 is the expanded block diagram of the video processing
block highlighted in Figure 12. The highlighted Sobel filter block is the component that was
replaced.
17
Figure 12: Vivado TRD PS Subsystem with Video Processing Highlighted
18
Figure 13: Vivado TRD Video Processing with Sobel Filter Highlighted
19
Using the TRD did not come without its share of compromises which are listed here:
- Because the TRD is a full-fledged demo, synthesizing and implementing the TRD project
took approximately 1 hour on a 3.7Ghz hexa-core system.
- Due to a Microsoft Windows 7 path limitation, building is only possible if directories are
kept very short—shorter than approximately 7 characters from drive root.
- Building the boot.bin file uses Petalinux which must be installed on a computer running
Linux.
- The TRD Video filter assumes to have an Advanced eXtensible Interface (AXI) stream
interface.
- Only some SD cards were found to work (Experimentally).
- The 2013.4 TRD was found unable to boot. (The 2013.3 TRD works fine.)
- Modifying the filter requires repackaging the HLS output into an Intellectual Property
(IP) block (increasing the revision number), copying the IP block over to the Vivado
project, upgrading the IP in the Vivado project. (Additional note: The project needs to be
closed and reopened, otherwise Vivado IP status report won’t detect the new IP).
- If the design doesn’t meet timing requirements, the reported Vivado error confusingly
indicates that you don’t have a license for a free piece of IP.
Implementing in HLS
The HLS solution was generated with the Xilinx HLS tool suite version documented in Figure
14.
20
Implementing the kernel in HLS involved modifying the existing HLS Sobel project bundled
with the Xilinx TRD. This example project provided the AXI streaming interface needed to
integrate into the TRD. A code snippet containing the kernel convolution is included in Figure
15.
21
The operation code is straight-forward, given a window of 9x9 pixels:
1. Multiply each pixel in the window by the kernel coefficient.
2. Sum all the pixel and coefficient products.
3. Shift the final sum to reduce the change in intensity.
Special size limited types (ap_unit) were used to reduce the quantity of bits required in FPGA
fabric. The ‘ap_’ data types were provided by Xilinx. Testing for correct operation was
simplified by the TRD included test bench which executed the C++ and generated an image with
the filter applied. Xilinx HLS can easily export the HLS solution to an RTL IP core with the
click of a button. This IP Core can be utilized by a Vivado project.
22
6. The HLS generated HDL code is very obfuscated and difficult to understand and modify.
(See Figure 17.)
7. The HLS generated HDL code commonly uses inverted logic. (See Figure 17.)
8. While HLS will export VHDL source, the IP Core generated by HLS is Verilog only.
23
Figure 17: Example of HLS generated Verilog Code.
24
Implementing in HDL
The HDL implementation was coded in VHDL. The kernel convolution was inserted into a HLS
generated pass through (i.e. no filter) project. The pass through filter project was used to generate
the input and output AXI stream interfaces used to connect into the Xilinx TRD. Using HLS
generated AXI Stream interfaces removed any resource or performance difference caused by
differing AXI implementations.
To operate within the TRD framework timing, the entire kernel convolution needed to happen in
a single ‘fast’ 150 MHz cycle. The first step in the kernel design was to buffer up 9 full rows of
pixels. FPGA internal block RAM resources were used to form pixel buffers. These pixel buffers
cannot be used directly for kernel convolution because block RAMs do not provide single cycle
random access to multiple addressable locations. Data in the pixel window used for convolution
would need to be cached locally from the pixel buffers for simultaneous access. This caching
was accomplished with pixel shift registers. The shift registers are fed via the line buffers and are
shift data through on every clock cycle. The convolution is performed with each pixel in the shift
register on every cycle. Each pixel in the kernel window cache is multiplied by a kernel
coefficient. The VHDL kernel coefficients are shown in Figure 18. The kernel convolution is
shown in Figure 19.
25
Figure 19: HDL Kernel Convolution Code Snippet
Summing the elements of the resultant matrix in a single cycle would require a significant
number of resources. To reduce the number of required resources by this implementation, the
sum operation was pipelined. The pipeline consists of four stages. In the pipeline three
intermediates are summed at each stage. The number of intermediates at each state is 81, 27, 9, 3,
then 1. The snippet of VHDL that performs the sum is included in Figure 20.
26
Figure 20: HDL Kernel Product Sum Code Snippet
27
Working Implementation
Both implementations were brought up successfully on the ZC702 with the TRD and work on
1080p video without dropping frames. For each implementation, the TRD Sobel filter was
removed and replaced with the custom kernel filter. Figure 21 is a photograph of the TRD
generated test pattern without any filtering. Figure 22 is a photograph of the test pattern after
enabling the filtering. In Figure 22 the fine lines that were present in Figure 21 are filtered into a
solid color.
28
Analysis
The HLS solution and HDL solution were compared for differences in performance (resource
utilization, maximum theoretical frequency) and Non-Recurring Engineering (NRE) cost.
1. Resources
One of the more important metrics applied to a custom logic design is resource utilization.
FPGA’s contain finite resources, and typically FPGA’s with fewer resources cost less. Designs
with fewer resources use less power. This paper includes utilization for Look Up Tables (LUTs),
Flip Flops (FFs), Block RAMs (BRAMs) and Digital Signal Processing (DSP) FPGA primitives
(DSP48s).
Each RTL implementation was synthesized, routed, and placed in an empty project to get an idea
of the ideal resource utilization. The version of Vivado used is documented in Figure 23.
The device targeted was xc7z020clg484-1. The ‘Vivado Synthesis Defaults (Vivado Synthesis
2013)’ synthesis strategy was used. The ‘Performance_Explore (Vivado Implementation 2013)’
implementation strategy was used. The device clock was constrained to 150 MHz.
29
Table 1: Empty Project Resource Utilization
Total
HDL HLS
Available
LUT 2989 4827 53200
FF 6139 5970 106400
BRAM 12 12 140
DSP48 0 0 220
DSP48
BRAM
FF
LUT
The HLS implementation used an additional 61% (1838) more LUTs than the HDL
implementation. All other resources were negligible.
The resource utilization of the full TRD was also recorded. The TRD includes many other
custom logic modules; resource utilization results from the TRD better indicate real world (non-
ideal) results.
Synthesis and implementation of the Xilinx TRD was performed with Xilinx Vivado 2013.4.
Full version information is documented in Figure 25.
30
Figure 25: Vivado 2013.4 Version Info
The device targeted was the xc7z020clg484-1. The ‘Vivado Synthesis Defaults (Vivado
Synthesis 2013)’ synthesis strategy was used. The ‘Performance_Explore (Vivado
Implementation 2013)’ implementation strategy was used. The device clock was constrained to
150 MHz.
Total
HDL HLS
Available
LUT 23767 25598 53200
FF 34160 34038 106400
BRAM 59.5 59.5 140
DSP48 23 23 220
DSP48
BRAM
FF
LUT
31
Summary
The HLS implementation used (1831) more LUTs than the HDL implementation. All other
resources were negligible.
2. Speed
Maximum operational speed is a measure of how fast the implementation can run before the
design no longer functions. Specific designs have additional constraints including routing
congestion and keep out areas that cause project specific clock speed reductions, but maximum
operational speed provides an indication to whether a particular module design will have
problems operating at a specific speed.
Maximum operational speed was determined by decreasing the clock period constraint in the
‘Empty Project’ and checking implementation timing closure. Summary results from this
experiment are documented in Table 3.
Summary
The HDL implementation maximum frequency was marginally faster than the HLS
implementation’s. Both implementations closed timing at the required 150 MHz.
3. Non-recurring Engineering
Reducing the amount of engineering effort required to complete a custom logic design can be a
very important performance objective. Total Non-recurring Engineering (NRE) was recorded for
each of the implementations. NRE can be dependent on the skills and abilities of the engineer
along with proper training. To help understand the NRE listed in this section, some background
on the engineer performing the work is warranted. The primary author of this paper (Mike
Zwagerman) has 8 years of industry experience: 4.5 years of embedded software, and 3.5 years
32
of custom logic design with HDLs. Various tutorials on HLS were performed before attempting
the kernel design and were not included in the design NRE sum. NRE totals are available in
Table 4.
NRE
HLS 15 hrs
HDL 33 hrs
Note: 16 hours of additional effort (not included in the current total) were required to update the
HLS pass through project to accept the HDL design. See an example of HLS generated code in
Figure 17.
Summary
Implementing the HDL design took more than double the effort of the HLS design.
Analysis Summary
The HLS design was implemented in half of the time, but required 61% more LUTs and did not
perform as fast in operational maximum frequency tests.
33
Conclusion
High Level Synthesis is a design process that is anticipated to replace the antiquated and time
consuming approach of designing digital logic in HDLs. Instead of coding RTL with
cumbersome HDLs, the HLS process parses high level software languages and generates
synthesizable RTL. Designing in high level software languages provide a mechanism for rapid
development and easy modification. Xilinx Vivado recently offered a new HLS tool (Vivado
HLS) to mainstream audiences. HLS claims to provide similar implementation performance to
traditional methods. It can be difficult to determine the suitability of HLS vs HDLs without
benchmarking. This paper provides a use case analysis of an example digital logic algorithm.
Resultant performance and NRE data from this use case analysis can be used as benchmark data
for deciding between HLS and HDLs design flows
This paper documents the implementation of an example digital logic algorithm in both HLS and
HDL process flows. The HLS implementation took significantly less time to produce a
functional module, but the HLS implementation was slower and uses significantly more
hardware resources. This paper also includes a number of lessons learned regarding using the
HLS design process flow.
34
Future Work
This paper describes an implementation of an example algorithm; future work could implement a
different or many different algorithms to determine of the results in this paper were statistically
significant. The FPGA implementation used for this paper was naïve un-optimized HSL and
HDL; future work could optimize the implementations to determine if the results remain
consistent across optimization level.
35
Works Cited
1. Coussy, P., D. D. Gajski, M. Meredith, and A. Takach. "An Introduction to High-Level
Synthesis." Design & Test of Computers, IEEE 26.4 (2009): 8-17. Web. 29 Jan. 2015.
4. "ZYNQ-7000 EPP ZC702 EVALUATION KIT ." . Xilinx Inc., 2012. Web. 2 Dec. 2014.
http://www.xilinx.com/publications/prod_mktg/zynq-7000-kit-product-brief.pdf
5. "Zynq Base TRD 2013.3." . Xilinx Inc., n.d. Web. 2 Dec. 2014.
http://www.wiki.xilinx.com/Zynq+Base+TRD+2013.3
6. Haoxing, Ren. "A Brief Introduction on Contemporary High-Level Synthesis." IC Design &
Technology (ICICDT), 2014 IEEE International Conference on 28-30 May 2014 : 1-4. Print.
7. "Xilinx buys high-level synthesis EDA vendor." . EE Times, 31 Jan. 2011. Web. 2 Dec. 2014.
http://www.eetimes.com/document.asp?doc_id=1258504
8. "XCN12014 - Product Change Notice for AutoESL." . Xilinx Inc., 6 Aug. 2012. Web. 2 Dec.
2014. http://www.xilinx.com/support/documentation/customer_notices/xcn12014.pdf
9. "UG998 - Introduction to FPGA Design with Vivado High-Level Synthesis." . Xilinx Inc., 2 July
2013. Web. 2 Dec. 2014.
36