High-Level Synthesis Based VLSI Architectures For Video Coding
High-Level Synthesis Based VLSI Architectures For Video Coding
High-Level Synthesis Based VLSI Architectures For Video Coding
Original Citation:
Ahmad, Waqar (2017). High-Level Synthesis Based VLSI Architectures for Video Coding. PhD
thesis
Availability:
This version is available at : http://porto.polito.it/2665803/ since: February 2017
Published version:
DOI:10.6092/polito/porto/2665803
Terms of use:
This article is made available under terms and conditions applicable to Open Access Policy Article
("Public - All rights reserved") , as described at http://porto.polito.it/terms_and_conditions.
html
Porto, the institutional repository of the Politecnico di Torino, is provided by the University Library
and the IT-Services. The aim is to enable open access to all the world. Please share with us how
this access benefits you. Your story matters.
By
Waqar Ahmad
******
Supervisor(s):
Prof. Guido Masera, Supervisor
Prof. Maurizio Martina, Co-Supervisor
Politecnico di Torino
2017
Declaration
I hereby declare that, the contents and organization of this dissertation constitute my
own original work and does not compromise in any way the rights of third parties,
including those relating to the security of personal data.
Waqar Ahmad
2017
I wish to express my sincere thanks to Prof. Guido Masera for his support and
advice. I would like to express my very great appreciation for his support and advice,
his visionary project ideas, and for providing such a great research environment.
I would like to express my deep gratitude to Prof. Maurizio Martina, who served
as PhD co-supervisor. His constant support, his enthusiastic encouragement and
exciting research ideas and his constructive criticism have been an invaluable help
for the success of this thesis. Also, I am particularly grateful for the technical and
non-technical help given by my supervisor Prof. Guido Masera and Prof. Maurizio
Martina. Furthermore, I want to thank all my colleagues from VLSI Lab. Polito for
an excellent work environment and a great time. I would further like to thank the
support and administrative team at VLSI Lab, who are doing a perfect job such that
PhD students can focus on their research work. Finally, I wish to thank my wife
Sania for her non-technical contributions to this work and for letting me follow my
passion, my lovely son Mujtaba, my parents for their constant support in what I do,
and my family and friends for reminding me of life outside my office.
Abstract
List of Figures x
Nomenclature xiii
1 Introduction 1
1.1 Introduction to Video Coding . . . . . . . . . . . . . . . . . . . . . 1
1.2 High Level Synthesis Based Video Coding . . . . . . . . . . . . . . 3
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . 7
4 High-Level Synthesis 37
4.1 What is High-Level Synthesis? . . . . . . . . . . . . . . . . . . . . 37
4.2 Overview of High-Level Synthesis Tools . . . . . . . . . . . . . . . 39
4.2.1 Academic HLS Tools . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Other HLS Tools . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 HLS Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Operation Chaining . . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Bitwidth Optimization . . . . . . . . . . . . . . . . . . . . 43
4.3.3 Memory Space Allocation . . . . . . . . . . . . . . . . . . 43
viii Contents
References 85
List of Figures
5.1 Pixel positions for Integer, Luma half and Luma quarter pixels. . . . 57
List of Figures xi
5.2 13x13 Pixel Grid for H.264/AVC Luma Interpolation of 8x8 block
(where green colour represents the integer pixels block to be interpo-
lated and yellow colour represents the required integer pixels padded
to the block to support interpolation). . . . . . . . . . . . . . . . . . 58
5.3 HLS implementation of H.264/AVC Luma Sub-pixel. . . . . . . . . 59
5.4 15x15 Pixel Grid for HEVC Luma Interpolation of 8x8 block (where
green colour represents the integer pixels block to be interpolated
and yellow colour represents the required integer pixels padded to
the block to support interpolation). . . . . . . . . . . . . . . . . . . 65
5.5 HLS implementation of HEVC Luma Sub-pixel. . . . . . . . . . . . 66
5.6 Chroma sample grid for eight sample interpolation . . . . . . . . . 69
5.7 7x7 Pixel Grid for HEVC chroma Interpolation of 4x4 block (where
green colour represents the integer pixels block to be interpolated
and yellow colour represents the required integer pixels padded to
the block to support interpolation). . . . . . . . . . . . . . . . . . . 71
5.8 HLS implementation of HEVC Chroma Sub-pixel. . . . . . . . . . 72
5.9 Design time comparison HLS vs Manual RTL Design. . . . . . . . 75
Acronyms / Abbreviations
FF Flip-Flop
HD High Definition
MB Macro Block
SD Standard Definition
TV Television
VoD Video-on-Demand
Chapter 1
Introduction
This chapter starts with an introduction to the fundamentals of video coding through
an historical perspective. Following this, the chapter surveys High-Level Synthesis
(HLS) based video coding. Subsequently, we propose an alternative methodology for
VLSI implementation of video coding algorithms and introduce its main components,
i.e., the HLS based simulation, verification, optimization and synthesis. We conclude
with an overview of the individual chapters, indicating the relevant contributions.
The process of compressing and decompressing video is called video coding or video
compression. Moving digital images are digitally compressed by video compression
algorithms. There is a long list of the video coding applications, some applications
of the video compression include TV, phones, laptops, cameras etc. Where there
is a digital video content, there should be video compression behind that content.
For the digital video large amount of the storage capacity is required if the video is
in its original form i.e. uncompressed. As an example, uncompressed 1080p high
definition (HD) video at 24 frames/second requires 806 GB of storage for a video of
1.5 hours duration with bit-rate requirement of 1.2 Gbits/second. That is why, for
storage and transmission purposes of the digital video, video compression is a must,
otherwise it will be impossible to store and process the uncompressed video contents
for applications of todays era. Decompression of compressed video is required for
displaying the video contents to the consumers.
2 Introduction
Sending visual images to a remote location has captured the human imagination
for more than a century Figure 1.1. The invention of television in 1926 by the
Scotsman John Logie Baird [1] led to the realisation of this concept over analogue
communication channels. Even analogue TV systems made use of compression or
information reduction to fit higher resolution visual images into limited transmission
bandwidths [2].
The emergence of mass market digital video in the 1990s was made possible
by compression techniques that had been developed during the preceding decades.
Even though the earliest videophones [3] and consumer digital video formats were
limited to very low resolution images (352x288 pixels or smaller), the amount
of information required to store and transmit moving video was too great for the
available transmission channels and storage media. Video coding or compression
was an integral part of these early digital applications and it has remained central to
each further development in video technology since 1990 [4].
By the early 1990s, many of the key concepts required for efficient video com-
pression had been developed. During the 1970s, industry experts recognised that
video compression had the potential to revolutionise the television industry. Efficient
compression would make it possible to transmit many more digital channels in the
bandwidth occupied by the older analogue TV channels.
Present-day video coding standards [5][6] and products share the following
features:
1.2 High Level Synthesis Based Video Coding 3
becomes interesting and has two fold advantage i.e. the hardware implementations
in the target device can be easily replaced and refined at higher abstraction level.
Nowadays, heterogeneous-systems are being adopted as the energy-efficient,
high-performance and high-throughput systems. The reason behind this is the
impossibility of further clock frequency scaling. These systems consist mainly
of two parts i.e., the application-specific integrated circuits (ASICs) [15] and the
software processor [16]. Each part of the system is dedicated for a specific task.
The design of these types of systems become very complex due to increase in
the complexity of the systems. ASICs are the dedicated hardware components
for the accelerated implementation of the computational complex parts for the
system. As stated above, due to increase in the complexity of the systems, the
design of these dedicated hardware also become complex and time-consuming.
Hardware description languages (HDLs) [17] are used for the register transfer level
(RTL) [18] implementation of these components. Cycle-by-cycle activity for RTL
implementation of these components is specified, which is a low abstraction level.
For such a low level of implementation, advanced expertise in hardware design are
required, alongside being unmanageable to develop. The impact of these low-level
implementation of complex systems increase the time-to-market by taking more
design and development time.
High-level synthesis (HLS) and FPGAs in combination, is an intriguing solution
to these problems of longer time-to-market and to realize these heterogeneous sys-
tems [19]. FPGAs are used for the configurable implementation of digital integrated
circuits. Manufacturing cost is an important factor in the implementation of digital
ICs. The use of FPGAs as reconfigurable hardware, help us the fast implementation
and optimization by providing ability to reconfigure the integrated circuits, hence,
removing the extra manufacturing cost. It allows the designer to re-implement modi-
fications made to the design, by changing the HDL code description, re-synthesize
and implement the design using same FPGA fabric by the help of implementation
tools. Thus HLS based FPGA implementation of digital systems can be helpful in
functional verification, possible hardware implementation and large design-space
exploration of the systems. FPGA based implementation of user applications can be
used an intermediate implementation before the ASICs and SoC implementation.
C, SystemC and C++ etc. are the High-level languages (HLLs) being used for
the software programming and development. HLS tools take HLL as input and HDL
1.2 High Level Synthesis Based Video Coding 5
lower resource usage over what other works could offer. Its reconfigurability also
provides better adaptability of many video coding interpolation algorithms.
In recent years, FPGA development has been moved towards higher abstraction levels.
The move not only helps improve productivity, but also lowers the barrier for more
algorithm designers to get access to the tempting FPGA platform. There is a wide
selection of tools available in the market that can be used for high-level synthesis.
Conventionally algorithm designers prefer using high-level languages such as C/C++
for algorithm developments, and Vivado HLS is one of the tools that is capable for
synthesis C/C++ code into RTL for hardware implementation. Nevertheless, most
high-level synthesis tools could not translate a high level implementation to a RTL
implementation directly, and users must restructure the high level implementations in
order to make them synthesizable and suitable for the specific hardware architecture.
Therefore, it becomes important to adapt to the high-level synthesis tool and to
discover approaches for achieving an efficient design with high performance and low
resource usage. The high-level synthesis tool used in this work is Vivado HLS from
Xilinx.
This thesis addresses the following issues:
1. How can engineers with limited FPGA experience quickly prototype an FPGA-
based SoC design for high performance video processing system?
2. How productive is Vivado HLS? What changes need to be made in order for a
software implementation to be synthesized to a hardware implementation?
3. How are the performance and area of video processing algorithms modelled
by Vivado HLS compared to that of RTL modelling from related works?
1.4 Contribution
This thesis work presents an FPGA-based video processing system rapid prototyping
flow that aims to lower the boundary between software and hardware development.
The rapid prototyping flow consists of two major parts: 1) the video processing
system architecture design, and 2) the video processing algorithms design. By un-
derstanding the underlying architecture of Xilinxs Zynq platform, I can quickly
assemble a video processing system on the block level with minimum RTL modifica-
tions. The development period can be reduced from months to weeks. In addition,
since Vivado HLS does not provide a common structure for domain-specific algo-
rithm designs, this thesis proposed HLS based hardware architecture for interpolation
filters of video coding algorithm designs in Vivado HLS. Several optimizations are
also done to the proposed interpolation filter architecture so that it not only improves
the video processing rate, but also reduces the flip-flop utilization and saves the LUT
utilization when comparing with similar works done in the literature. This work
demonstrates the possibility of rapid prototyping of a computation-intensive video
processing system with more than enough of the real-time processing performance.
This thesis is organized as the following: Chapter 2 discusses and compares the
sate-of-the-art video coding standards i.e. High Efficiency Video Coding (HEVC),
H.264/AVC and standardized extensions of HEVC. Also included in Chapter 2
are several related works that were done by others as well as some background
information related to the coding tools that have been added in 3D-HEVC. Moreover,
Chapter 3 describes the coding complexity analysis of 3D-HEVC. It identifies the
computational intensive tools of 3D-HEVC encoder and decoder. Chapter 3 also
discusses the class-wise coding and decoding time distribution of different classes
(tools). In addition, Chapter 4 presents the High-level synthesis, available High-level
synthesis tools and Xilinx Vivado Design Suite. Chapter 5 describes the HLS based
implementation of interpolation filters of HEVC and H.264/AVC, which is one of
the computational intensive part of the video coding algorithms. It includes the
performance and resource utilization comparison between my work and other works.
8 Introduction
Last but not least, Chapter 6 will conclude this thesis with discussion about the
contributions, challenges and future work.
Chapter 2
An enabling technology for digital television systems worldwide was, the MPEG-2
video coding standard [25], which was an extension of MPEG-1. MPEG-2 was
widely used for transmission of TV signals of High definition (HD) and Standard
Definition (SD) over a variety of transmission media such as terrestrial emission,
cable, satellite and for storage onto DVDs.
The popular growth of HDTV and its services increase the need for higher
coding efficiency. Coding efficiency enhancement allows the transmission of high
quality and higher number of video channels over already available digital media
transmission infrastructures e.g. UMTS, xDSL, Cable Modem etc. These mediums
allow less data rates as compared to the broadcast channels.
The evolution of video coding in applications of telecommunication include
the development of H.261 [29], H.262 [25][26], H.263 [27], H.264/AVC [24][6]
and H.265 (HEVC) [5] video coding standards. The prominent telecommunication
applications are wireless mobile networks, ISDN, LAN and T1/E1. To maximize the
coding efficiency, significant efforts dealing with the loss/error optimization, network
types and formatting of the characteristic have been made. This evolution of the
video coding standards expanded the capabilities like video shaping and broadened
the application areas of the digital video.
The scope of the video coding standard is shown in Fig.2.1. The transportation
and storage media for video signal is not present in the scope of the video coding
standard. The standardization of the decoder is central to video coding standard in
all ISO/IEC and ITU-T standards. The standardization is about the syntax, bitstream
structure and procedure for decoding the syntax elements. This makes for all the
2.2 H.264/AVC Video Coding 11
decoders to produce the same type of output when an encoded input bitstream
conforming to the constraints of a specific standard is given. This limitation in
standards scope, allows the flexibility and freedom for optimized implementations
e.g. time-to-market, quality of compression, cost of implementation etc. But there
is no guarantee of reproduction quality as any crude coding technology can be
conforming to the standard.
The prominent applications areas of H.264/AVC for which the technical solutions
are designed include the following
12 State-of-the-art Video Coding Standards
More flexible selection of pictures ordering for display and referencing pur-
poses.
For improvement in the coding efficiency, the following parts of the standard were
also enhanced:
For the more robust and flexible operations, the new design features included in
H.264/AVC standard are as follows:
For the efficient and robust header information conveyance the Parameter set
structure is provided.
The logical data packet is used for every syntax structure, this is called NAL
unit. This structure provides more flexibility in terms of customization for
transmission of the video content over specific networks.
Supports Data Partitioning, allows the partitioning of the slice syntax up-to
three different parts, for purpose of transmission, it depends on syntax elements
categorization.
HEVC has been designed to accomplish many goals, including integration of trans-
port system, coding efficiency, resilience to data losses and architectures implemen-
tation using parallel processing.
Main features of the HEVC design are described briefly in the following para-
graphs.
2.3 High Efficiency Video Coding (HEVC) 15
Video coding layer utilize the hybrid approach for intraprediction, interprediciton
and 2-D transform coding, the same approach was used in all previous video coding
algorithms since H.261. Hybrid video encoders block diagram is shown in Fig. 2.3,
which could make a HEVC conformed bitstream.
Fig. 2.3 HEVC video encoder (Light gray elements show decoder).
Highlighted features of HEVC are given in the following text. A more detailed
version of these properties can be found in [9].
Coding tree block (CTB) and Coding tree units (CTUs ): One luma CTB,
related chroma CTBs comprise the CTU. The size of the luma CTB can be
LxL, where L= 16, 32, or 64 pixels. The larger the size the better the compres-
sion. CTBs are partitioned into smaller blocks of quadtree-like structure and
signalling [30].
Coding blocks (CBs) and Coding units (CUs) : One luma CB and two cor-
responding chroma CBs and the related syntax comprise a coding unit (CU).
CUs are partitioned into prediction units (PUs) and transform units (TUs) tree.
16 State-of-the-art Video Coding Standards
Prediction blocks (PBs) and Prediction units : Luma and chroma CBs can be
further partition in size and predicted from luma and chroma PBs depending
on the decision of the prediction-type. 6464 down to 44 samples variable
PB sizes are supported in HEVC.
Transform blocks (TBs) and Transform units (TUs): Transforms blocks are
used for the coding of prediction residual. Supported TB size are 32x32, 16x16
and 4x4.
For slice data structure modification and parallel processing enhancement for the
purpose of packetization, new features are added in the HEVC. In the context of a
particular application, these features have specific benefits.
1. Range extensions
2. Scalability extensions
3. 3D video extensions
The structures of enhanced chroma sampling 4:2:2 and 4:4:4 and pixel bit depths
more than 10 bits are supported in the range extensions of the HEVC. Range ex-
18 State-of-the-art Video Coding Standards
tensions are applicable to the areas of screen content coding, direct source content
coding of the RGB, auxiliary pictures coding and lossless and high bit-rate coding.
The draft range extensions can be found in [31].
The coarse grain SNR and spatial scalability are possible through scalability exten-
sions of the HEVC also termed as "SHVC". [32] provides the draft text of scalability
extensions. SNR and spatial scalability in SHVC can combined with already avail-
able temporal scalability [33][34]. Resampling of the decoded reference layer
picture is performed when spatial scalability is used. This resampling is performed
by the use of upsampling filter defined specifically for the spatial scalability scenario.
Depth for a visual scene can perceived by the multiview and 3D video formats
in combination with the proper 3D display system. There are two type of the 3D
displays available in the market:
Multiview HEVC
The 3-view case prediction structure of the multiview is shown in Fig. 2.4.
HEVC is capable of flexible management of the reference pictures. This capability
of the HEVC enables the inter-view sample prediction.
The correlation between views exploited through residual and motion data. Block-
level changes make possible the exploitation of this correlation as shown in 2.5
3D-HEVC is the HEVC extension for which the working draft and the reference
test model are specified in [38], [39]. The advanced coding tools for multiple views
are included in this extension. [40] becomes the basis for the 3D-HEVC. Prominent
3D-HEVC tools are presented in the following paragraphs.
Neighbouring block based disparity vector (NBDV) is the 3D-HEVC tool used for
the identification of similar blocks in multiple different views. This tools design is
very similar to the merge mode and AMVP in HEVC. For inter-view pixel prediction
of spatial and temporal neighbouring blocks, NBDV is used which make use of
already available disparity vectors [41].
Fig. 2.6 shows the spatial neighbouring blocks used for the NBDV process, these
are same blocks as in merge modes/AMVP of HEVC. The order of the blocks access
is also same as in merge mode: A1 , B1 , B0 , A0 , and B1 .
The merge mode modification by the addition of more candidates make the realization
of the inter-view motion prediction. No modification to the AMVP is made. The new
merge list has six candidates. The construction of the list is still same as in HEVC.
Additional two candidates can be put into the list as described in the following text.
NBDV provides the index of the reference picture and motion vector of block
found, as shown in Fig. 2.5. This is the first candidate inserted into the merge
list. NBDV also provides the disparity vector and index of the reference inter-view
2.5 3D High Efficiency Video Coding (3D-HEVC) 21
picture. This is the second candidate inserted in merge list. Disparity vector insertion
into the candidate list does not depend on existence of the inter-view candidate [42].
B2 B1 B0
Current Block
A1
A0
Fig. 2.6 Spatial Neighbouring blocks for NBDV.
Fig. 2.7 shows that the TMVP co-located block of view 1 at time 1 for current
block, have a reference index 0 and disparity vector according to the current pic-
tures temporal reference. That is why the TMVP candidate is usually regarded as
unavailable. The candidate is regarded as available by changing the reference target
index to 2 i.e. according to the inter-view reference picture.
22 State-of-the-art Video Coding Standards
In current non-base view i.e. for the block DC , motion compensation is carried
out using the VD motion vector as shown in as shown in Fig. 2.8. The NBDV vector
identifies the BC inter-view block. Then, by the use of VD , the motion compensation
is performed by the base view reconstructed Br and BC . Addition of this predicted
signal to the signal predicted by motion compensation of Dr is performed. The
precision of the current blocks residual signal is best as same VD vector is used. This
residual prediction can be can weighted by 1 or 0.5, with ARP enabled.
The calibration of the cameras in lighting effects and colour transfer is very important.
Otherwise, the prediction of the cameras recording the same scene may fail. For
improvement in the coding efficiency of the blocks predicted through inter-view
pictures, new coding tool named as illumination compensation is developed [44].
24 State-of-the-art Video Coding Standards
The disparity vector of the current PU is used for the identification of reference view
neighbour sample as shown in Fig. 2.9.
Current PU Reference PU
Coding tools specific to the depth for efficient depth information representation, are
added in the 3D-HEVC design. These tools allow the non-rectangular partitioning of
the depth blocks. Depth coding modes such as depth modelling modes (DMM) [45],
simplified depth coding (SDC) [46] and region boundary chain coding (RBC) [47]
are used for partition-based depth intra coding. Fig. 2.10 shows the division of depth
PU as one or two parts. DC value is used for representing each part of the depth PU.
P0 P0 P0 P0 P0 P0 P0 P0
P0 P0 P0 P1 P1 P0 P0 P0
P0 P0 P1 P1 P1 P1 P0 P1
P0 P1 P1 P1 P1 P1 P1 P1
Two types of depth partitioning are available in case of DMM. These are contour
and wedge-shaped pattern. As shown in Fig 2.10(a), In case of the wedge-shaped
pattern the depth PU is segmented by a straight line. Connected chain in a series
fashion are used for segmenting the depth PU in case of RBC as shown in Fig.
2.10(b). Fig. 2.11 shows the partitioning of the depth PU based on contour pattern.
As shown, these are irregular partitions with separate sub-regions.
26 State-of-the-art Video Coding Standards
The motion parameters of the texture block can be used for the depth block. Merge
list of current depth block is modified by the addition of one more candidate, making
the inheritance of motion parameters of texture block for the corresponding depth.
The co-located block of texture helps in the generation of the extra candidate [48].
For the reduction of the inter-view redundancy, the VSP approach is used. In this
approach for texture view warping depth data information is used. By this method a
current view predictor can be generated [49].
Chapter 3
polation filters. While some papers in the literature are available on the complexity
evaluation of some tools of 3D-HEVC encoder/decoder, no results are currently avail-
able to specifically explore the complexity and hardware implementation analysis of
renderer model of 3D-HEVC used for the View Synthesis Optimization (VSO). [51]
presents time profiling of 3D-HTM 10.2 reference software, in which the complexity
of texture and Depth Modelling Modes (DMMs) used for depth maps encoding, is
given. Inter-prediction encoding time percentage for 3D-HTM 8.0 reference software
is given in [52], no information is presented regarding the complexity analysis of
rendering distortion estimation model for 3D-HEVC.
Renderer model is used for RDO of depth maps coding by estimation of synthe-
sized view distortion. Depth maps are used for virtual view synthesis. Depth maps
lie at the core of 3D video technologies. Distortion in depth maps coding effect the
quality of intermediate virtual views generated during the process of DIBR. Because
of these important observations and based on the profiling result, in Chapter 6, we
have focused on the Renderer model. Identification of computational hotspots help
both in decreasing the complexity and increasing the performance by developing the
efficient tools and by implementing the accelerated software and hardware solutions
for real time visualization of the 3D video coding standard.
Dependent Dependent
Dependent Dependent
HEVC Depth View View
View View
Texture View Texture Depth
Texture Depth
Coder Coder Coder Coder
Coder Coder
Binary
Bitstream
Disparity-Compensated Prediction
Disparity Compensated Prediction (DCP) is used for the inter-view prediction of de-
pendent views. The incorporation of DCP affects only the reference list construction
procedure i.e. already coded pictures of other views and same access unit are added
in the reference picture lists.
Inter-view motion prediction is used for eliminating the data redundancy of the
multiple views. The detailed description of Inter-view motion prediction is given in
[53]. The motion information of current block of dependent view is obtained from
corresponding block in the reference view.
30 Coding Complexity Analysis of 3D-HEVC
3D HEVC Tools
Dependent Texture Coding Tools
Depth based Inter-view Motion Inter-view Residual Prediction
Parameter Prediciton (DBIvMPP) (IvRP)
Disparity Compensated
Prediction (DCP)
In [54] the Advanced Residual Prediction (ARP) is described in detail. The correla-
tion between the residual of already coded view and residual of current view also
exist. To compensate this correlation advanced interview prediction is used.
Depth maps represent the distance of the objects in scene from the camera. Depth
maps are used for view synthesis of intermediate views in multi-view generation
systems. Depth maps consist of constant value regions with sharp edges. For depth
maps intra-prediction, additional coding tools are used.
3.2 Complexity Analysis 31
In [55] depth modelling modes are introduced for coding of the depth maps intra-
prediction. These tools divide the depth maps in two different non-rectangular
regions of constant values for intra coding.
3D-HEVC is based on video plus depth format. Depth maps facilitate the synthesis of
intermediate views on the decoder side for applications like 3D-TV, Free viewpoint
TV etc. The compression errors of depth maps result in synthesis artefacts for
the intermediate views rendered through Depth Image Based Rendering (DIBR)
methods. To remove these coding artefacts in the virtual view synthesis process,
the Synthesized View Distortion Computation (SVDC) models are included in 3D-
HEVC. Encoding and decoding time Complexity analysis of 3D-HEVC standard
is presented in this section. Profiling of the reference software of 3D-HEVC is
carried out using gprof and gcc compiler of the standard video sequences mentioned
in the Common Test Conditions (CTC). Results based on profiling show that (18-
26%) of total encoding time is consumed by the Renderer model. Alongside other
compute-intensive parts i.e. Motion Estimation (ME) and Interpolation Filtering,
32 Coding Complexity Analysis of 3D-HEVC
Although in [51] and [52] partial profiling results of 3D-HEVC texture and depth
maps are presented. We cannot directly compare our profiling results with results
presented in [51] and [52] because our results are more detailed up-to class/function
level. Table 3.1 shows the comparison between the profiling results of encoder of
3D-HEVC and HEVC [59] standards for Random Access (RA) and All Intra (AI)
configurations, respectively. As shown in the Table 3.1, TComRdCost class consumes
majority of the time spent in encoding i.e. about 31.3-35.4% and 9.8-38.8% in both
configuration of 3D-HEVC and HEVC, respectively. Motion Estimation (ME), Inter
view residual, Inter view motion prediction and other distortion operations takes
place in TcomRdCost class. Operations like Sum of Absolute Difference (SAD),
3.2 Complexity Analysis 33
Hadmard transform (HAD) and Sum of Squared Error (SSE) for Rate Distortion
Computation are performed. Depth maps estimation used in inter view motion
prediction for the calculation of disparity vector derivation in dependent views is
also calculated in this class. TRenSingleModelC class consumes about (26.8% and
18.3 %) of time. Process like VSO and SVDC estimation takes place in this class.
Process of rendering is used for Virtual View Synthesis generation. In Random
Access (RA) configuration, the time taken by TComInterpolationFilter class is about
(19.3%) and (19.8%) , respectively, where the motion compensation Vertical and
Horizontal Filtering (VHF) occurs. Interpolation filtering is used, whenever the
inter-view residual prediction, de-blocking and View Synthesis prediction is applied.
In 3D-HEVC and HEVC, TComTrQuant class accounts for about (9% and 10%)
and (24.4% and 10.7%) of total encoding time, respectively. In TComTrQuant the
process of Rate-Distortion Optimized Quantization (RDOQ) occurs. As the name of
34 Coding Complexity Analysis of 3D-HEVC
class shows, in TComTrQuant, the process of rate and distortion optimized transform
and quantization takes place. TEncSearch accounts for about (8.7 % and 3.6 %) and
(11.8 % and 7.4 %) of the time in both configurations of HEVC encoders, respectively.
In TEncSearch , the encoder searches for the cost and rate-distortion computation of
modes for inter, intra, depth intra for DMM, for motion estimation processes and
Advanced Motion Vector Prediction (AMVP) of HEVC based standards. Similarly
for intra prediction classes like TComPrediction and TComPattern contribute about
(2% to 7%) to the total encoding time in both configuration of 3D-HEVC. Actual
optimized Transform takes place in partialButterfly* and contribute about (2.3 %
and 4.1 %) and (2.3 % and 4.1 %) in both configurations of the standards. Other
classes like TEncSbac,TComDataCU and TComYuv contribute about (1%-3%) to
total encoding time in both configurations.
Table 3.2 shows the decoding time distribution of 3D-HEVC and HEVC Decoder.
Classes contributing significantly in terms of time consumption, in the decoding
process, are shown. In all intra configuration more than quarter of total time is spent
in TComInterpolationFilter. In the process of motion compensation, interpolation
filtering is used. TComCUMvField, TComLoopFilter, TComDataCU classes also
account for most of the decoding time. In these classes the processes internal to
CU, advance motion vector prediction and filtering takes place. TComYuv is a
general YUV buffer class, it manages the memory related functionalities of decoder.
In random access configuration, partialButterflyInverse, TComPattern, TDecCu
and TComLoopFilter classes are computationally intensive classes in the decoding
process. In these classes processes related to inverse transform, functions related to
coding unit, intra prediction and loop filtering takes place. In HEVC and 3D-HEVC,
the computational complexity of classes varies from one standard to the other, as
observed from the profiling results.
3.3 Identified Computational Complex Tools 35
Fig. 3.3 shows the identification and mapping of computationally intensive parts of
the 3D-HTM standard. The identification of these parts is carried out by mapping
the profiling results of C++ HTM encoder and decoder classes to 3D-HEVC High
level encoder coding tools. From the profiling results, it is identified that the major
part of the encoding time of 3D-HEVC is consumed in motion estimation including
interview motion prediction, encoder control regions consisting of the VSO, SVDC
by the use of rendering method and interpolation filters, as shown in Fig. 3.3.
Identified computational intensive parts of 3D-HEVC standard are listed as follows:
36 Coding Complexity Analysis of 3D-HEVC
3. Interpolation Filters
General
General Control Data
Coder Control
N Views x 1
input MUX Transform,
Scaling and Quantized Transform Coefficients
Quantization Scaling and
Inverse
Transform Header
Formatting
TComRdCost and CABAC
Coded
Bitstream
TComTrQuant
Intra-Picture
Estimation Filter
TRenSingleModelC Control
Analysis
Intra Prediction Data
TEncSearch
TComRdCost Intra-Picture
TComPrediction Prediction Filter Control Data
Inter/Intra
Selection
Deblocking
TComPattern and SAO
Motion
Filters
Compensatio
partialButterfly* n
TEncSbac
TRenSingleModelC Rendered Decoder
TComDataCU Reference Decoder
Picture
Decoder
Image R(Ref) Picture
Buffer Ref
Picture
TComInterpolation TComInterpolationFilter Index 0 Buffer
Index 1
Buffer `
Filter Motion Data
TRenModel
Image
Rendering
TComYuv
High-Level Synthesis
In this chapter, an analysis of HLS techniques, HLS tools, current HLS research
topics are presented. For the automatic design of customized application-specific
hardware accelerators, academia and industry are working together. Three academic
tools considered are Delft workbench automated reconfigurable VHDL generator
(DWARV ) [60], BAMBU [61], and LEGUP [62] alongside many other commercial
available tools. Many research challenges are still open in HLS domain.
HLS tools overview is presented in this section. As shown in Figure 4.1, HLS tools
are presented by classifying the design input language. Two classes of the tools
are made based on input languages. First category of tools accept general-purpose
languages (GPLs) and the second category of the tools accept domain-specific
languages (DSLs) as input. Further splitting of the DSLs tools is made on the basis
of tools invented for GPL-based dialects and for a specific tool-flow. The tools
are categorized, in each category red, blue and green fonts are used for the tools.
Where red shows the tool is obsolete, blue represents N/A i.e. no information about
the usability status of the tool and green shows the tool is still in use. The figure
legends show the application areas of the tool. The use of SystemC or DSLs as input
language increase the chances of tools adoption by the software developers.
In the following paragraphs we presented available commercial and academic
HLS tools with brief description. Information regarding the target application domain,
automatic generation of test bench and support for fixed and floating arithmetic can
found in [63].
Garp: Main aim of this project was the loops acceleration of general-purpose
(GP) software services [71].
Napa-C: Developed at Stanford University, this was the first project which
considers systems containing configurable logic and microprocessors compila-
tion based on high-level synthesis [72].
Catapult-C: HLS tool initially developed for ASICs but now it is used for
both ASICs and FPGA [75].
C-to-Silicon (CtoS): Developed by Cadence used for both dataflow and con-
trol applications. SystemC is the input language [76].
Trident: Produce VHDL based hardware accelerators for floating point appli-
cations [80].
42 High-Level Synthesis
C2H: Technology dependent tool targeting Altera soft processor and Avalon
bus based hardware accelerator units [81].
This optimization execute operation scheduling for the specified clock period. In a
single clock cycle, by the use of this optimization, two combinational operators can
be chain together removing the false paths [92].
By the use of bit-width optimization, the number of bits needed for the data-path
operators are reduced. All the non-functional requirements such as power, area and
performance are impacted by the application of this optimization. It does not affect
the behaviour of design.
Distributed block RAMs (BRAMs) are present in FPGAs as form of multiple memory
banks. The partitioning and mapping of the software data structures is supported
by this structure of the FPGAs. It makes the fast memory accesses implementation
at minimum cost. Other way around, the memory ports are very limited in these
elements. To configure and customize the memory accesses may need the making of
an efficient and optimized architecture based on multi-bank in order to reduce the
performance limitation [93].
44 High-Level Synthesis
Loops are the compute-intensive parts of the algorithms. Hardware acceleration for
these types of algorithms having compute-intensive loops is significantly important.
Loop pipelining is major performance optimization factor for the hardware imple-
mentation of the loops. Loop-level parallelism can be exploited by this optimization,
if the data dependencies are mitigated, this optimization allows a new loop iteration
before finishing of its predecessor. This idea of loop pipelining is related to the
software pipelining [94], very long instruction word processors (VLIW) already use
this concept. To fully exploit the advantage of parallelism, combination of the loop
pipelining and multi-bank architecture is frequently used [93].
In HLS, meeting the timing requirements for efficient implementation and minimiz-
ing the resources usage, it is very necessary to to have the knowledge of how to
implement each operation. The given behavioural specification is first inspected
by the front-end phase of the HLS implementation. This inspection identifies the
characteristics of operations e.g operand type (float and integer), operation type
(arithmetic or non-arithmetic), bit-width etc. Some of the operations get benefited
from some specific optimizations. For example, division and multiplications by a
constant value are transformed into operations of adds and shifts [95], [96] for the
improvement of the timing and area. The resulting timing and resources of the circuit
are heavily impacted by this methodology. So, for efficient HLS, the composition of
this type of library is very crucial.
4.3.8 If-Conversion
As stated earlier, AutoPilot [90] initially developed by AutoESL, later Xilinx ac-
quired AutoPilot in 2011 and it becomes Vivado HLS [91]. Xilinx HLS is based on
LLVM and released in early 2013. This improved product includes in it, a complete
environment for the design with rich characteristics for generation of fine-tune HDL
from HLL. Accepting C++, C and SystemC as the input and generating hardware
modules in Verilog, VHDL and SystemsC. At the time of compilation it gives the
possibility of applying various optimizations such as loop unrolling, operation chain-
ing and loop pipelining etc. Furthermore, memory specific optimizations can be
applied. For the simplification of accelerator integration, both, shared and streaming
memory interfaces are supported.
The transformation of C code into RTL level implementation in terms of Verilog
or VHDL, synthesis of the generated HDL code into Xilinx FPGA, is the flow
adopted by the Vivado HLS. Input C code could be in C++, C, SystemC and Open
46 High-Level Synthesis
Software and hardware domains can bridged through High-level synthesis. The
primary benefits of HLS are listed as follows:
Scheduling
Binding
For the sequencing of operation in RTL design i.e. to generate finite state machine
(FSM), control logic is extracted.
In HLS, C code is synthesized as follows:
Latency: Information about the required clock cycles for computation of all
values of output.
Initiation interval (II): Information about the required clock cycles for accept-
ing new inputs.
Loop iteration latency: Information about the required clock cycles for com-
pletion of single iteration of loop.
Loop initiation interval: Information about the required clock cycles for before
next iteration of loop gets start.
Loop latency: Information about the required clock cycles for all iterations of
loop.
48 High-Level Synthesis
Directives
Constraints
HDL based RTL implementation files The following RTL formats are sup-
ported:
Report files
An overview of input and output files of Vivado HLS is shown in Fig. 4.2.
Top-level function in any C program is called main(). Any sub-function can be speci-
fied as top-level function in Vivado HLS. main() cannot be synthesized. Additional
rules are as follows:
Test Bench
Language Support
C Libraries
For the FPGA implementation, Vivado HLS contains optimized C libraries. High
quality of results (QoR) are achieved by using these libraries. In addition to the
standard C language libraries, Vivado HLS provides an extended support for the
following C libraries:
Video functions
Math operations
4.4 Xilinx Vivado Design Suite 51
Maximized usage of shift register LUT (SRL) resources using FPGA resource
functions
A project based on Vivado HLS can holds multiple solutions for a set of C code.
Different optimizations and constraints can be applied in each solution. The results
based on each solution can be compared in Vivado HLS GUI.
The steps involved in the Vivado HLS design process i.e synthesis, optimization
and analysis, are listed as follows:
3. Design synthesis.
4. Results analysis.
By the analysis of the results, if the design does not meet the requirements,
a new solution can be created and synthesized based on new optimization direc-
tives and constraints.The process can be repeated until the design performance and
requirements are met. The advantage of multiple solutions is moving forward in
development and still retaining the old results.
Optimization
Task Pipelining.
Latency specification.
Analysis
In Vivado HLS, the results can be analysed using the Analysis Perspective. The
performance tab in the Analysis Perspective allows to analyse the synthesis results.
RTL Verification
ModelSim simulator
RTL Export
Final RTL output files can be exported as an IP package in Xilinx Vivado Design
Suite. The supported IP formats are listed as follows:
For use in Embedded Development Kit (EDK) and for import into Xilinx
Platform Studio (XPS): Pcore
For import directly into the Vivado Design Suite: Synthesized Checkpoint
(.dcp)
Chapter 5
Video processing systems are becoming more complex thus decreasing the produc-
tivity of the hardware designers and the software programmers, producing design
productivity gap. To fill this productivity gap, hardware and software fields are
bridged through High Level Synthesis (HLS), thus improving the productivity of
the hardware designers. One of the most computational intensive parts of High
Efficiency Video Coding (HEVC) and H.264/AVC video coding standards is the
Interpolation filtering used for sub-pixel interpolation. In this chapter, we present a
HLS based FPGA Implementation of sub-pixel Luma and chroma Interpolation of
HEVC and sub-pixel Luma interpolation of H.264/AVC, respectively. Xilinx Vivado
Design Suite is used for the FPGA implementation of interpolation filtering on Xilinx
xc7z020clg481-1 device. The consequent design results in a frame processing speed
of 41 QFHD, i.e. 3840x2160@41fps for H.264/AVC sub-pixel Luma interpolation,
46 QFHD for HEVC luma sub-pixel and 48 QFHD for HEVC chroma interpolation.
The development time is significantly decreased by the HLS tools.
For 4:2:0 colour format video in H.264/AVC, luma sampling supports the quarter-
pel accuracy and chroma sampling support one-eight pixel accuracy of the motion
vectors [6]. Motion vector may points to an integer and/or fractional samples position.
In the latter case, fractional pixel are generated by interpolation. A one-dimensional
6-tap FIR filter is used for prediction signals at the half-sample value, in vertical and
horizontal directions. Average of the sample values at full and half-pixel are used for
the quarter sample values generation of the prediction signal.
The luma sub-pixel interpolation process in H.264/AVC is shown in Fig. 5.1.
The half pixel values b0,0 and h0,0 are obtained by applying the 6-tap filter in the
horizontal and vertical directions, respectively, as follows:
b0,0 = (A2,0 5 A1,0 + 20 A0,0 + 20 A1,0 5 A2,0 + A3,0 + 16) >> 5 (5.1)
h0,0 = (A0,2 5 A0,1 + 20 A0,0 + 20 A0,1 5 A0,2 + A0,3 + 16) >> 5 (5.2)
bn = bn,2 5 bn,1 + 20 bn,0 + 20 bn,1
(5.3)
5 bn,2 + bn,3
j0,0 = (bn + 512) >> 10 (5.4)
hn = A2,0 5 A1,0 + 20 A0,0 + 20 A1,1 5 A2,0 + A3,0 (5.5)
56 HLS Based FPGA Implementation of Interpolation Filters
j0,0 = (hn,2 5 hn,1 + 20 hn,0 + 20 hn,1 5 hn,2 + hn,3 + 512) >> 10 (5.6)
Nearest, half pixel and/or integer pixel averaging is used for the calculation of
the quarter-pixel sample. The samples used in the averaging could be both half-pel
and a combination of the half-pel and integer-pel samples.
As an example, the following equations shows the method to calculate quarter-
pixel samples for some of the quarter-pixel positions i.e. a0,0 , f0,0 and e0,0 out of
a0,0 , c0,0 , d0,0 , n0,0 , f0,0 , i0,0 , k0,0 , q0,0 , e0,0 , g0,0 , p0,0 and r0,0 :
Fig. 5.1 Pixel positions for Integer, Luma half and Luma quarter pixels.
In our proposed design, 13x13 integer pixels are used for the half and quarter pixel
interpolation of the 8x8 PU as shown in Fig. 5.2. In Fig. 5.3, the proposed HLS
implementation of H.264/AVC luma sub-pixel interpolation is shown. For the larger
PU sizes, half and quarter pixel can be interpolated using each 8x8 PU part of the
larger block i.e. dividing the larger block in PU sizes of 8x8. 13 integer pixels are
given as input to the first half pixel interpolator array hpi1 in each clock cycle.
58 HLS Based FPGA Implementation of Interpolation Filters
Fig. 5.2 13x13 Pixel Grid for H.264/AVC Luma Interpolation of 8x8 block (where green
colour represents the integer pixels block to be interpolated and yellow colour represents the
required integer pixels padded to the block to support interpolation).
8 half pixels b0,0 are computed in parallel in each clock cycle, so in total it will
interpolate 13x8 half pixels in 13 clock cycles. These half pixels are stored into
registers for interpolation of the half pixels j0,0 or quarter pixels a0,0 and c0,0 . During
the interpolation of b0,0 half pixels interpolation, 13x13 integer pixels are stored for
the half pixel interpolation of the h0,0 . Then the h0,0 half pixels are interpolated using
these stored 13x13 integer pixels using hpi1, meanwhile, in parallel the j0,0 half
pixel are interpolated using hpi2 from the already available intermediate b0,0 half
pixels. The half pixels h0,0 and j0,0 are also stored in the registers for the quarter pixel
interpolation. Finally all the a0,0 , c0,0 , d0,0 , n0,0 , f0,0 , i0,0 , k0,0 , q0,0 , e0,0 , g0,0 , p0,0 and
5.2 H.264/AVC Sub-pixel Interpolation 59
r0,0 quarter pixels are generated using the already computed registered half pixels
b0,0 , h0,0 , j0,0 and the 13x13 integer pixels.
Vivado Design Suite is used for HLS based the FPGA implementation of the
design. The HLS based design is synthesized to verilog RTL. Vivado HLS tools take
C,C + + or SystemC codes as input. In our case the C code is applied as input to
the vivado HLS tool. The C code is written according to the H.264/AVC reference
software video encoder. Vivado HLS provides various optimization techniques called
as optimization directives or pragmas. Many variants of the HLS implementation
of H.264/AVC luma sub-pixel interpolation are possible depending on the area
vs performance trade-off requirements. Design Space Exploration (DSE) of the
H.264/AVC luma sub-pixel interpolation is carried out using these optimization
directives.
Discussion on Results
Vivado HLS kept the loops as rolled by default. Loops are considered and operated
as single sequence of operations defined within the body of the loop. All operations
of the loops defined in the body of the loops are synthesized as hardware. So, all
iterations of the loops use the same common hardware. Loop UNROLL directive
available in the Vivado HLS, unrolls the loops partially or fully, depending on the
application requirements. If the application is performance critical, then the loop
UNROLL directive can be used to unroll the loops for better optimized hardware
in terms of performance by parallel processing, but it will increase the area e.g. if
the loops are fully unrolled then the multiple copies of the same hardware will be
synthesized. The other directive which we used in our design is PIPELINE. Pipeline
directive can be applied to function or loop, it is basically the pipelining. The new
inputs can be processed after every N clock cycles. Here N is Initiation Interval (II)
i.e. the number of clock cycles after which the new inputs will be processed by the
design.
When the pipeline directive is applied, it automatically unrolls all the loops
within the scope of the pipeline region i.e. you do not need to apply loop UNROLL
directive separately if the pipeline directive is already applied to the scope containing
loop. For the parallel processing the data requirement must be satisfied. In our design
the arrays are used as input to the HLS tools. Arrays are by default mapped to block
RAMs in the Vivado HLS i.e. you can only read or write or both read write at the
5.2 H.264/AVC Sub-pixel Interpolation 61
same time if the block RAM is dual port. So, ARRAY PARTITION directive is used
to partition the arrays into individual registers. It makes the data available for the
parallel processing.
Two different implementations of the H.264/AVC luma sub-pixel interpolation is
carried out using two different techniques for constant multiplication i.e. multiplica-
tion using multipliers, multiplication using add and shift operations.
Table 5.1 Resources required for HLS implementation of H.264/AVC Luma Sub-pixel
Interpolation using multipliers for multiplication.
Optimization BRAM18K DSP48E FF LUT SLICE Freq. (MHz) Clock Cycles Fps
NO OPTIMIZATION 0 0 706 1188 430 128 1489 0.5
LOOP UNROLL 0 0 3011 5084 1670 110 577 1.5
LOOP UNROLL + ARRAY PARTITION 0 0 3451 8653 2655 112 473 2
PIPELINE + ARRAY PARTITION 0 0 10224 27995 8817 102 19 41
Table 5.2 Resources required for HLS implementation of H.264/AVC Luma Sub-pixel
Interpolation using add and shift operations for multiplication.
Optimization BRAM18K DSP48E FF LUT SLICE Freq. (MHz) Clock Cycles Fps
NO OPTIMIZATION 0 0 302 413 110 129 1432 1
LOOP UNROLL 0 0 2304 3033 422 212 577 3
LOOP UNROLL + ARRAY PARTITION 0 0 2843 4056 748 210 449 3
PIPELINE + ARRAY PARTITION 0 0 11001 12774 2606 102 19 41
Table 5.1 and 5.2 enlist the optimization directives used and the corresponding
hardware resources required for HLS implementation using the multipliers as con-
stant multiplication and multiplication by shift and add operations. Mainly three
directives are used for the efficient implementation of H.264/AVC Luma interpola-
tion designs. As shown in Table 5.1 and 5.2 , when there is NO OPTIMIZATION
directive applied, the latency is much higher i.e. to process 8x8 PU it takes higher
clock cycles as compared to the optimized ones. For the optimized design we use
the combination of optimization directives such as LOOP UNROLL + ARRAY
PARTITION and PIPELINE + ARRAY PARTITION. In both designs the application
of optimizations shows significant area vs performance trade-off. In case of constant
multiplication using add and shift operations, we have better optimized design in
terms of area and performance.
Table 5.3 gives the comparison between HLS and manual RTL implementations of
H.264/AVC luma sub-pixel interpolation. It evident that the HLS implementation is
62 HLS Based FPGA Implementation of Interpolation Filters
more efficient in terms of performance. Even though the other two implementations
are VLSI based, we expect the same performance for the FPGA implementations of
the corresponding implementations.
Table 5.4 gives the comparison between HLS implementations of H.264/AVC and
HEVC luma sub-pixel interpolation. The proposed implementation takes less area
as compared to the HLS implementation of HEVC because the HEVC uses larger
interpolation filters and hence larger area. The throughput of the HEVC luma
interpolation is also higher because the quarter pixel interpolation is independent of
the half pixel interpolation e.g. a0,0 , b0,0 , d0,0 and h0,0 .
Ai, j upper-case letters within the yellow blocks in Fig. 5.1 represent luma sample
positions at full-pixel locations. For the prediction of fractional luma sample values,
5.3 HEVC Sub-pixel Interpolation 63
these integer pixel at ful-pixel locations can be used. White blocks with lower-case
letters e.g. a0,0 , b0,0 represent the luma sample positions at quarter-pixel locations.
Fractional luma sample positions are computed by Equations (5.10 5.24). Fractional
luma sample values a0,0 , b0,0 , c0,0 , d0,0 , h0,0 and n0,0 are computed by by applying
7 and 8-tap interpolation filters to the integer pixel values specified by Equations
(5.105.15) as follows:
The quarter-pixel values denoted as e0,0 , i0,0 , p0,0 , f0,0 , j0,0 , q0,0 , g0,0 , k0,0 and
r0,0 are computed by applying 7 and 8-tap filters in vertical direction to the already
computed values of a0,i , b0,i and c0,i where i = -3 .. 4, as shown by the Equations
(5.16 5.24)
In our proposed design, 15x15 integer pixels are used for the half and quarter pixel
interpolation of the 8x8 PU as shown in Fig. 5.4. In Fig. 5.5, the proposed HLS based
implementation of HEVC luma sub-pel interpolation is shown. For the larger PU
sizes, half and quarter pixel can be interpolated using each 8x8 PU part of the larger
block i.e. dividing the larger block in PU sizes of 8x8. 15 integer pixels are given as
input to the array of sub-pixel interpolator filter i.e. FilterSetabc1FilterSetabc8 in
5.3 HEVC Sub-pixel Interpolation 65
Fig. 5.4 15x15 Pixel Grid for HEVC Luma Interpolation of 8x8 block (where green colour
represents the integer pixels block to be interpolated and yellow colour represents the required
integer pixels padded to the block to support interpolation).
each clock cycle. 24 sub-pixels i.e. 8a, 8b, 8c are computed in parallel in each
clock cycle, so in total it will interpolate 15x24 half pixels in 15 clock cycles. These
half pixels are stored into registers for computing the half pixels e.g. e0,0 , f0,0 , j0,0
etc. 15x15 integer pixels are stored for the half pixel interpolation of the a0,0 , b0,0 , c0,0
etc. Then the d0,0 , h0,0 , n0,0 half pixels are interpolated using these stored 15x15
integer pixels using the same filter set. Finally the half pixels e.g. e0,0 , f0,0 , j0,0 etc.
are interpolated using the already stored half pixels a0,0 , b0,0 , c0,0 .
66 HLS Based FPGA Implementation of Interpolation Filters
Table 5.5 Resources required for HLS based HEVC luma implementation using multipliers
for multiplication.
Optimization BRAM18K DSP48E FF LUT SLICE Freq. (MHz) Clock Cycles Fps
NO OPTIMIZATION 0 0 1221 1845 718 218 1505 1
LOOP UNROLL 0 0 4132 14031 2167 150 190 6
LOOP UNROLL + ARRAY PARTITION 0 0 8201 17215 3356 165 130 10
PIPELINE + ARRAY PARTITION 0 0 11490 29534 9315 165 59 21
Table 5.6 Resources required for HLS based HEVC luma implementation using add and shift
operations for multiplication.
Optimization BRAM18K DSP48E FF LUT SLICE Freq. (MHz) Clock Cycles Fps
NO OPTIMIZATION 0 0 719 1243 325 210 966 2
LOOP UNROLL 0 0 3203 8598 1388 180 190 7
LOOP UNROLL + ARRAY PARTITION 0 0 6456 12766 1890 165 88 14
PIPELINE + ARRAY PARTITION 0 0 11122 14452 3477 165 28 46
Table 5.7 gives the comparison between HLS and manual RTL implementations of
HEVC luma sub-pixel interpolation. It is evident that the HLS implementation is
more efficient in terms of performance. Even though the other three implementations
are VLSI based, we expect the same performance for the FPGA implementations of
the corresponding implementations.
Bi, j upper-case letters within the shaded blocks in Fig. 5.6 represent chroma sample
positions at full-pixel locations. For the prediction of fractional chroma sample
68 HLS Based FPGA Implementation of Interpolation Filters
values, these integer pixel at full-pixel locations can be used. Un-shaded blocks
with lower-case letters e.g. ab0,0 , ac0,0 represent the chroma sample positions at
eight-pixel locations.
Fractional chroma sample positions are computed by Equations (5.26 5.32).
Fractional chroma sample values ab0,0 , ac0,0 , ad0,0 , ae0,0 , a f0,0 , ag0,0 and ah0,0
are computed by by applying 4-tap interpolation filters to the integer pixel values
specified by Equations (5.105.15) as follows:
Fractional chroma sample values ba0,0 , ca0,0 , da0,0 , ea0,0 , f a0,0 , ga0,0 and ha0,0
are computed by by applying 4-tap interpolation filters to the integer pixel values
specified by Equations (5.335.39) as follows:
ah-1,0 B0,0 ab0,0 ac0,0 ad0,0 ae0,0 af0,0 ag0,0 ah0,0 B1,0
bh-1,0 ba0,0 bb0,0 bc0,0 bd0,0 be0,0 bf0,0 bg0,0 bh0,0 ba1,0
ch-1,0 ca0,0 cb0,0 cc0,0 cd0,0 ce0,0 cf0,0 cg0,0 ch0,0 ca1,0
dh-1,0
A-1,0 da0,0 db0,0 dc0,0 dd0,0
A0,0 de0,0
a0,0 df0,0
b0,0 dg0,0
c0,0 dh0,0
A1,0 da1,0
gh-1,0 ga0,0 gb0,0 gc0,0 gd0,0 ge0,0 gf0,0 gg0,0 gh0,0 ga1,0
A-1,1
hh-1,0 ha0,0 hb0,0 hc0,0 hd0,0
A0,1 he0,0 hf0,0 hg0,0 hh0,0
A1,1 ha1,0
Fractional chroma sample values bV0,0 , cV0,0 , dV0,0 , eV0,0 , fV0,0 , gV0,0 and hV0,0
for V being replaced by b, c, d, e, f, g and h, respectively, are computed by applying
4-tap interpolation filters to the intermediate values aV0,i with i = -1..2 in the vertical
direction as given by Equations (5.405.46) as follows:
The value of variable shift1 is given by the Equation (5.47) and shift2 = 6.
In our proposed design, 7x7 integer pixels are used for the eight-pixel interpolation
of the 4x4 chroma PU as shown in Fig. 5.7. In Fig. 5.8, the proposed HLS based
implementation of HEVC chroma sub-pel interpolation is shown. For the larger PU
sizes, eight-pixel can be interpolated using each 4x4 PU part of the larger block i.e.
dividing the larger block in PU sizes of 4x4. 7 integer pixels are given as input to
the array of sub-pixel interpolator filter i.e. FilterSetbcde f gh1 -FilterSetbcde f gh4
in each clock cycle. 28 sub-pixels i.e. 4a, 4b, 4c, 4d, 4 f , 4g and 4h are computed
in parallel in each clock cycle, so in total it will interpolate 7x28 sub pixels in 7
clock cycles. These sub-pixels are stored into registers for computing the half pixels
e.g.bb0,0 , bc0,0 , bh0,0 etc. 7x7 integer pixels are stored for the sub-pixel interpolation
of the bb0,0 , bc0,0 , bh0,0 etc. Then the ba0,0 ha0,0 sub-pixels are interpolated using
these stored 7x7 integer pixels using the same filter set. Finally the sub-pixels e.g.
e0,0 , f0,0 , j0,0 etc. are interpolated using the already stored half pixels ab0,0 ah0,0 .
5.3 HEVC Sub-pixel Interpolation 71
Fig. 5.7 7x7 Pixel Grid for HEVC chroma Interpolation of 4x4 block (where green colour
represents the integer pixels block to be interpolated and yellow colour represents the required
integer pixels padded to the block to support interpolation).
Table 5.8 Resources required for HLS based HEVC chroma implementation using multipliers
for multiplication.
Optimization BRAM18K DSP48E FF LUT SLICE Freq. (MHz) Clock Cycles Fps
NO OPTIMIZATION 0 0 615 1020 313 220 1424 2
LOOP UNROLL 0 0 2144 7010 1077 162 178 7
LOOP UNROLL + ARRAY PARTITION 0 0 4671 9156 1796 169 116 11
PIPELINE + ARRAY PARTITION 0 0 6723 15884 5358 169 53 24
Table 5.9 Resources required for HLS based HEVC chroma implementation using add and
shift operations for multiplication.
Optimization BRAM18K DSP48E FF LUT SLICE Freq. (MHz) Clock Cycles Fps
NO OPTIMIZATION 0 0 321 654 325 215 622 3
LOOP UNROLL 0 0 1603 8598 745 176 145 9
LOOP UNROLL + ARRAY PARTITION 0 0 3288 12766 967 169 79 17
PIPELINE + ARRAY PARTITION 0 0 5986 14452 1752 169 27 48
5.3 HEVC Sub-pixel Interpolation 73
Table 5.8 and 5.9 enlist the optimization directives used and the corresponding
hardware resources required for HLS implementation using the multipliers as con-
stant multiplication and multiplication by shift and add operations. Mainly three
directives are used for the efficient implementation of HEVC Chroma interpolation
designs. As shown in Table 5.8 and 5.9 , when there is NO OPTIMIZATION di-
rective applied, the latency is much higher i.e. to process 4x4 PU it takes higher
clock cycles as compared to the optimized ones. For the optimized design we use
the combination of optimization directives such as LOOP UNROLL + ARRAY
PARTITION and PIPELINE + ARRAY PARTITION. In both designs the application
of optimizations shows significant area vs performance trade-off. In case of constant
multiplication using add and shift operations, we have better optimized design in
terms of area and performance.
Table 5.10 gives the comparison between HLS and manual RTL implementations of
HEVC chroma sub-pixel interpolation. It is evident that the HLS implementation is
more efficient in terms of performance. Even though the other implementation is
VLSI based, we expect the same performance for the FPGA implementations of the
corresponding implementations.
In this work, hardware implementation of the HEVC interpolation filters and H.264/AVC
luma interpolation based on HLS design flow is presented. The throughput of HLS
accelerator is 41 QFHD, i.e. 3840x2160@41fps for H.264/AVC sub-pixel Luma
74 HLS Based FPGA Implementation of Interpolation Filters
interpolation, 46 QFHD for HEVC luma sub-pixel and 48 QFHD for HEVC chroma
interpolation. It achieves almost the same performance as the manual implemen-
tation with half of development time. The comparison of the Xilinx Vivado HLS
based implementation with already available RTL-style hardware implementations
built using Xilinx System Generator is presented. Both hardware designs were
implemented using Xilinx Vivado Suite as stand-alone cores by using Xilinx Virtex
7 xc7z020clg481-1 device. By optimizing and refactoring the C algorithmic model,
we became able to implement a design that has almost the same performance as
the reference implementation. The design time required for HLS based implemen-
tation is significant low about half of the design time required by the manual RTL
implementation.
Our design is semi-automated by using Vivado HLS tools. Other designs were
developed using manual RTL Verilog/VHDL typical design flow. Design time for
Xilinx Vivado HLS was extracted from control log of source code. This time shows
the design time taken needed by a designer who has expertise in tool not a domain
expert. This means, the designer takes an unfamiliar code, use the tool to implement
it as first design, refactor the code to have an optimized desired architecture. Discover
the RTL improvements made in the algorithm by reverse engineering the original
RTL code and perform design-space exploration. Originally, the algorithm itself do
not have those improvements in software model. The throughput in terms of fps of
the manual design almost the same as that of the HLS design.
For comparison purpose we are designing the same interpolation filters using both
design methodologies. The breakdown of each methodology for the purpose of
evaluation and comparison of development times is shown in Fig. 5.9. For the
same hardware functionality both the design flows have different design steps with
different development time. Vivado HLS based design allows implementing the
architecture in about 3 months. The manual design will take about 6 months to
implement the final hardware design. A trade-off between performance and design
time is observed in both methodologies: It is observed that the manual design flow
takes almost double the design time than HLS design flow.
5.3 HEVC Sub-pixel Interpolation 75
architecture design
Micro-architectural
Interpolation Filter
process definition
Algorithm Study
RTL Design
Filtering order
Verification
Area/Timing
optimization
High-level
Synthesis
Simulation
Design
Testing
Final Design Testing
Simulation C and C/
RTL Co Simulation
C++ adaptation to
Optimization and
Interpolation filter
familiarization
Synthesis code
generation,
HLS tool
HLS code
Thus we were able to draw the conclusion that High Level Synthesis (HLS) tools
provides shorter development time by automatically generating the hardware imple-
mentations still working at a higher level of software abstraction [112]. Furthermore,
HLS design flow has large design-space exploration of different performance and
area trade-off. HLS flow is also efficient in terms of simulation, documentation
design, coding, debugging and reuse. Instead, manual design needs long time to
generate a single hardware architecture that requires further redesign effort if time or
area constraints are not fulfilled.
Chapter 6
6.1 Conclusions
Time-to-market is the critical factor for digital systems, thus increasing the require-
ment of FPGA platforms. FPGA platforms help to avoid manufacturing cycles and
chip design time. Design time reduction at the cost of increased power, performance
or cost is acceptable for the designers. Latest HLS tools provide option to designers
to consider HLS tools as hardware implementation due to significant design time
reduction and comparable Quality-of-Results (QoR) as manual RTL design.
For verification and testing steps, manual design needs more time and effort
since possibility of errors is considerable in this kind of design, unlike automated
design flows that minimize errors prone. Generation of RTL code is automatized in
proposed HLS design. But the generation of the interface between processor and
accelerator is still manual configured.
Verification productivity is the key factor in the adoption of high-level synthesis
methodologies for hardware implementation. Design simulation at high-level i.e. C
level is much faster than the simulation at RTL level implementation. This does not
mean that the RTL verification is no more needed, instead it states that the design
time can be significantly reduced by reducing the verification-debug cycles. In RTL,
the verification-debug cycles take a lot of design time. The HLS users can achieve
almost two times improvement in verification process with almost the same design
performance.
Depth maps are used for virtual view synthesis. Distortion in depth maps coding
affect the quality of synthesized views. SVDC due to the coding of current depth
block is given by
78 Conclusions and Future Work
SV DC = DdistCB DorgCB
h i2
= (x,y)S ST,distCB (x, y) ST,Re f org (x, y) (6.1)
h i2
(x,y)S ST,orgCB (x, y) ST,Re f org (x, y)
where ST,Re f org is the reference view synthesized from original input left, right
textures ST,l , ST,r and depth maps SD,l , SD,r . ST,orgCB and ST,distCB are views synthe-
sized by using original SorgCB (indicated in yellow) and distorted SdistCB (indicated in
red) data of current depth block as shown in the Fig. 6.1. T , D, l, r, x , y and S shows
texture, depth, left, right, horizontal component, vertical component of pixel and
the set of pixels in the synthesized view, respectively. DdistCB , DorgCB are distortion
while using distorted and original current depth block, respectively. SV DC is the
difference between DdistCB , and DorgCB . As shown in the Fig. 6.1, as an example, if
the current depth frame is divided into four depth blocks B0, B1, B2 and B3, where
the B0 and B1 are already coded blocks, B3 is the current block to be coded and B4
is the original block to be coded after coding of B3 is finished. SSD unit computes
the Sum of Squared Differences of ST,Re f org and ST,distCB , ST,orgCB .
6.2 Future Work 79
TRenSingleModelC Class basically implements the renderer model for View Syn-
thesis Optimization. This is one of the computational complex part of the 3D-HEVC
encoder. Complexity of this class comes from rendering functionalities needed for
SVDC computation of each coded depth block. As shown in Fig. 6.2, renderer
model consists of three main processing units. Detailed algorithmic description of
the renderer model is given [114]. Functional description and hardware complexity
analysis of the each part of the renderer is described in the following paragraphs.
Initializer
High level hardware architecture of renderer model is shown in Fig. 6.3. Buffers are
used for temporary data storage between the external memory and the processing
elements. Initializer synthesize the reference view ST,Re f org from original input
textures ST,l , ST,r and depth maps SD,l , SD,r stored in the T (texture) and depth (D)
maps memories, respectively, as shown in Fig. 6.4. Initialization is performed once
for every frame. Depth maps are stored as renderer model depth states. Reference
view and intermediate variables are stored in the ST (synthesized texture) memory
for faster re-rendering. At the time of initialization, the up-sampling of input textures
and depth maps is carried out to be used later in interpolation step. Up-sampled
6.2 Future Work 81
textures Sups(T,l) , Sups(T,r) and depth maps Sups(D,l) , Sups(D,r) are stored in the UpS
(Up-sampled ) memory.
Partial Re-renderer
The partial re-rendering of the synthesized view need to be performed when the
coding of a depth block is finished. The reconstructed depth block is given to
the renderer model, the renderer model updates the depth maps stored as renderer
model depth states in the initialization step. Instead of rendering the whole views
ST,distCB and ST,orgCB which is computational too complex, the partial re-rendering
algorithm is applied which re-renders only parts of synthesized view which got
affected by the depth block coding as shown in Fig. 6.5, for right view to left
view rendering, the same steps hold for the left view to right view rendering. 1 - 7
are the samples position in the reference and synthesized views and SDisp,r are the
disparity values. D1distCB - D7distCB are distortions due to samples 1 - 7, respectively.
There are basically different steps involved in the re-renderer algorithm i.e. warping,
interpolation, occlusion, dis-occlusion and blending. The hardware diagram of the
re-renderer is shown in Fig. 6.6. In the warping the depth maps are warped to the
synthesis position by disparity value calculated by
dv = (s v + o) >> n (6.2)
82 Conclusions and Future Work
where v, s, o and n represent the depth sample value, transmitted scale factor,
transmitted offset vector and shifting parameters, respectively. The depth blocks
are stored in the depth memory DM. The warping itself is very complex process
requiring the calculation of depth values z from depth maps by the use of camera
parameters.
Warping may cause the original image pixel to be mapped at the fractional pixel
position, so rounding required for sub-pixel i.e. half-pel or quarter-pel position
depending upon the quality requirement of synthesized view. After the warping, one
6.2 Future Work 83
of the three functions will be carried out i.e. interpolation, occlusion or dis-occlusion.
The interpolation position is measured by
!
xFP xs
xb = 4 + xs (6.3)
xe xs
where xFP is the integer interpolation location between samples interval boundaries xs
and xe of the intermediate synthesized view. xb shows the position of the sample value
in up-sampled input textures Sups(D,l) or Sups(D,r) . Occlusion and dis-occlusion are
handled by z-buffering and hole filling algorithms. Filled positions of dis-occluded
regions are stored as filling maps SF,l , SF,r .
Blending of synthesized left and right view takes place like blending procedure
of view synthesis reference software. Re-rendering process requires high memory
cost in terms of z-buffers for occlusion handling, re-order buffers in warping.
SVDC Calculator
The hardware diagram of SVDC calculator is shown in Fig. 6.7. SVDC is calculated
by SSD of ST,Re f org and ST,orgCB , ST,distCB .
84 Conclusions and Future Work
The view synthesized from original texture and depth maps is used as reference
view for SVDC calculation. RDO optimization for depth maps coding is carried out
i.e. to decide whether to use the applied depth coding mode or not, based on this
synthesized view distortion change value.
References
[1] Russell W Burns. John Logie Baird: Television Pioneer. Number 28. Iet,
2000.
[2] Randy D Nash and Wai C Wong. Simultaneous transmission of speech and
data over an analog channel, April 16 1985. US Patent 4,512,013.
[3] Sara A Bly, Steve R Harrison, and Susan Irwin. Media spaces: bringing people
together in a video, audio, and computing environment. Communications of
the ACM, 36(1):2846, 1993.
[4] Didier Le Gall. Mpeg: A video compression standard for multimedia applica-
tions. Communications of the ACM, 34(4):4658, 1991.
[5] Gary J Sullivan and Jens-Rainer Ohm. Recent developments in standardiza-
tion of high efficiency video coding (hevc). In SPIE Optical Engineering+
Applications, pages 77980V77980V. International Society for Optics and
Photonics, 2010.
[6] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra.
Overview of the h. 264/avc video coding standard. IEEE Transactions on
circuits and systems for video technology, 13(7):560576, 2003.
[7] T Koga. Motion-compensated interframe coding for video conferencing. In
proc. NTC 81, pages C96, 1981.
[8] Fei Sun, Srivaths Ravi, Anand Raghunathan, and Niraj K Jha. Application-
specific heterogeneous multiprocessor synthesis using extensible processors.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 25(9):15891602, 2006.
[9] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand.
Overview of the high efficiency video coding (hevc) standard. IEEE Transac-
tions on circuits and systems for video technology, 22(12):16491668, 2012.
[10] Ercan Kalali and Ilker Hamzaoglu. A low energy sub-pixel interpolation
hardware. In Image Processing (ICIP), 2014 IEEE International Conference
on, pages 12181222. IEEE, 2014.
86 References
[23] Benjamin Bross, Woo-Jin Han, Jens-Rainer Ohm, Gary J Sullivan, Ye-Kui
Wang, and Thomas Wiegand. High efficiency video coding (hevc) text specifi-
cation draft 10. JCTVC-L1003, 1, 2013.
[24] Joint Video Team. Advanced video coding for generic audiovisual services.
ITU-T Rec. H, 264:1449610, 2003.
[25] MPEG-4 Committee et al. Generic coding of moving pictures and associated
audio information: Video. ISO/IEC, 2000.
[26] Barry G Haskell, Atul Puri, and Arun N Netravali. Digital video: an introduc-
tion to MPEG-2. Springer Science & Business Media, 1996.
[27] Karel Rijkse. H. 263: Video coding for low-bit-rate communication. IEEE
Communications magazine, 34(12):4245, 1996.
[28] Mislav Grgic, Branka Zovko-Cihlar, and Sonja Bauer. Coding of audio-visual
objects. In 39th International Symposium Electronics in Marine-ELMAR" 97,
1997.
[29] Thierry Turletti. H. 261 software codec for videoconferencing over the Internet.
PhD thesis, INRIA, 1993.
[30] Hanan Samet. The quadtree and related hierarchical data structures. ACM
Computing Surveys (CSUR), 16(2):187260, 1984.
[31] D Flynn, M Naccari, K Sharman, C Rosewarne, J Sole, GJ Sullivan, and
T Suzuki. Hevc range extensions draft 6. Joint Collaborative Team on Video
Coding (JCT-VC) JCTVC-P1005, pages 917, 2014.
[32] J Chen, J Boyce, Y Ye, and MM Hannuksela. Scalable high efficiency video
coding draft 3. Joint Collaborative Team on Video Coding (JCT-VC) document
JCTVC N, 1008, 2014.
[33] Ajay Luthra, Jens-Rainer Ohm, and Jrn Ostermann. Requirements of the
scalable enhancement of hevc. ISO/IEC JTC, 1, 2012.
[34] Gary Sullivan and Jens-Rainer Ohm. Joint call for proposals on scalable video
coding extensions of high efficiency video coding (hevc). ITU-T Study Group,
16, 2012.
[35] Ying Chen, Ye-Kui Wang, Kemal Ugur, Miska M Hannuksela, Jani Lainema,
and Moncef Gabbouj. The emerging mvc standard for 3d video services.
EURASIP Journal on Applied Signal Processing, 2009:8, 2009.
[36] Anthony Vetro, Thomas Wiegand, and Gary J Sullivan. Overview of the stereo
and multiview video coding extensions of the h. 264/mpeg-4 avc standard.
Proceedings of the IEEE, 99(4):626642, 2011.
[37] G Tech, K Wegner, Y Chen, MM Hannuksela, and J Boyce. Mv-hevc draft
text 9, document jct3v-i1002. Sapporo, Japan, Jul, 2014.
88 References
[38] G Tech, K Wegner, Y Chen, and S Yea. 3d-hevc draft text 7, document
jct3v-k1001. Geneva, Switzerland, Feb, 2015.
[39] Y Chen, G Tech, K Wegner, and S Yea. Test model 11 of 3d-hevc and mv-
hevc. Document of Joint Collaborative Team on 3D Video Coding Extension
Development, JCT3V-K1003, 2015.
[40] H Schwarz, C Bartnik, S Bosse, H Brust, T Hinz, H Lakshman, D Marpe,
P Merkle, K Mller, H Rhee, et al. Description of 3d video technology
proposal by fraunhofer hhi (hevc compatible; configuration a). ISO/IEC JTC,
1, 2011.
[41] Li Zhang, Ying Chen, and Marta Karczewicz. Disparity vector based advanced
inter-view prediction in 3d-hevc. In Circuits and Systems (ISCAS), 2013 IEEE
International Symposium on, pages 16321635. IEEE, 2013.
[42] Li Zhang, Y Chen, and L Liu. 3d-ce5. h: Merge candidates derivation from
disparity vector. ITU-T SG, 16, 2012.
[43] L Zhang, Y Chen, X Li, and M Karczewicz. Ce4: Advanced residual predic-
tion for multiview coding. In Joint Collaborative Team on 3D Video Coding
Extensions (JCT-3V) document JCT3V-D0117, 4th Meeting: Incheon, KR,
pages 2026, 2013.
[44] H Liu, J Jung, J Sung, J Jia, and S Yea. 3d-ce2. h: Results of illumination
compensation for inter-view prediction. ITU-T SG16 WP3 and ISO/IEC
JTC1/SC29/WG11 JCT3V-B0045, 2012.
[45] Karsten Muller, Philipp Merkle, Gerhard Tech, and Thomas Wiegand. 3d
video coding with depth modeling modes and view synthesis optimization. In
Signal & Information Processing Association Annual Summit and Conference
(APSIPA ASC), 2012 Asia-Pacific, pages 14. IEEE, 2012.
[46] Fabian Jger. 3d-ce6. h results on simplified depth coding with an optional
depth lookup table. JCT3V-B0036, Shanghai, China, 2012.
[47] Jin Heo, E Son, and S Yea. 3d-ce6. h: Region boundary chain coding for
depth-map. In Joint Collaborative Team on 3D Video Coding Extensions
(JCT-3V) Document JCT3V-A0070, 1st Meeting: Stockholm, Sweden, pages
1620, 2012.
[48] YW Chen, JL Lin, YW Huang, and S Lei. 3d-ce3. h results on removal of
parsing dependency and picture buffers for motion parameter inheritance.
Joint Collaborative Team on 3D Video Coding Extension Development of
ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JCT3V-C0137, 2013.
[49] Sehoon Yea and Anthony Vetro. View synthesis prediction for multiview video
coding. Signal Processing: Image Communication, 24(1):89100, 2009.
References 89
[50] Christoph Fehn, Eddie Cooke, Oliver Schreer, and Peter Kauff. 3d analysis
and image-based rendering for immersive tv applications. Signal Processing:
Image Communication, 17(9):705715, 2002.
[51] Gustavo Sanchez, Mrio Saldanha, Gabriel Balota, Bruno Zatt, Marcelo
Porto, and Luciano Agostini. Dmmfast: a complexity reduction scheme for
three-dimensional high-efficiency video coding intraframe depth map coding.
Journal of Electronic Imaging, 24(2):023011023011, 2015.
[52] Pin-Chen Kuo, Kuan-Hsing Lu, Yun-Ning Hsu, Bin-Da Liu, and Jar-Ferr
Yang. Fast three-dimensional video coding encoding algorithms based on
edge information of depth map. IET Image Processing, 9(7):587595, 2015.
[53] Heiko Schwarz and Thomas Wiegand. Inter-view prediction of motion data
in multiview video coding. In Picture Coding Symposium (PCS), 2012, pages
101104. IEEE, 2012.
[54] Xiang Li, Li Zhang, and C Ying. Advanced residual predction in 3d-hevc. In
2013 IEEE International Conference on Image Processing, pages 17471751.
IEEE, 2013.
[55] Philipp Merkle, Christian Bartnik, Karsten Mller, Detlev Marpe, and Thomas
Wiegand. 3d video: Depth coding based on inter-component prediction of
block partitions. In Picture Coding Symposium (PCS), 2012, pages 149152.
IEEE, 2012.
[56] Martin Winken, Heiko Schwarz, and Thomas Wiegand. Motion vector in-
heritance for high efficiency 3d video plus depth coding. In Picture Coding
Symposium (PCS), 2012, pages 5356. IEEE, 2012.
[57] Gerhard Tech, Heiko Schwarz, Karsten Mller, and Thomas Wiegand. 3d
video coding using the synthesized view distortion change. In Picture Coding
Symposium (PCS), 2012, pages 2528. IEEE, 2012.
[58] Karsten Muller and Anthony Vetro. Common test conditions of 3dv core
experiments. In JCT3V meeting, JCT3VG1100, 2014.
[59] Frank Bossen, Benjamin Bross, Karsten Suhring, and David Flynn. Hevc
complexity and implementation analysis. IEEE Transactions on Circuits and
Systems for Video Technology, 22(12):16851696, 2012.
[60] Razvan Nane, Vlad-Mihai Sima, Bryan Olivier, Roel Meeuws, Yana Yankova,
and Koen Bertels. Dwarv 2.0: A cosy-based c-to-vhdl hardware compiler. In
Field Programmable Logic and Applications (FPL), 2012 22nd International
Conference on, pages 619622. IEEE, 2012.
[61] Christian Pilato and Fabrizio Ferrandi. Bambu: A modular framework for the
high level synthesis of memory-intensive applications. In Field Programmable
Logic and Applications (FPL), 2013 23rd International Conference on, pages
14. IEEE, 2013.
90 References
[62] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kam-
moona, Jason H Anderson, Stephen Brown, and Tomasz Czajkowski. Legup:
high-level synthesis for fpga-based processor/accelerator systems. In Proceed-
ings of the 19th ACM/SIGDA international symposium on Field programmable
gate arrays, pages 3336. ACM, 2011.
[63] Razvan Nane, Vlad-Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort,
Andrew Canis, Yu Ting Chen, Hsuan Hsiao, Stephen Brown, Fabrizio Ferrandi,
et al. A survey and evaluation of fpga high-level synthesis tools. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems,
35(10):15911604, 2016.
[64] Kazutoshi Wakabayashi and Takumi Okamoto. C-based soc design flow and
eda tools: An asic and system vendor perspective. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 19(12):1507
1522, 2000.
[65] Wim Meeus, Kristof Van Beeck, Toon Goedem, Jan Meel, and Dirk
Stroobandt. An overview of todays high-level synthesis tools. Design
Automation for Embedded Systems, 16(3):3151, 2012.
[66] Seth Copen Goldstein, Herman Schmit, Mihai Budiu, Srihari Cadambi,
Matthew Moe, and R Reed Taylor. Piperench: A reconfigurable architecture
and compiler. Computer, 33(4):7077, 2000.
[67] Nikolaos Kavvadias and Kostas Masselos. Automated synthesis of fsmd-based
accelerators for hardware compilation. In Application-Specific Systems, Archi-
tectures and Processors (ASAP), 2012 IEEE 23rd International Conference
on, pages 157160. IEEE, 2012.
[68] Nikhil Subramanian. A C-to-FPGA solution for accelerating tomographic
reconstruction. PhD thesis, University of Washington, 2009.
[69] Mentor Graphics. Dk design suite: Handel-c to fpga for algorithm design,
2010.
[70] Walid A Najjar, Wim Bohm, Bruce A Draper, Jeff Hammes, Robert Rinker,
J Ross Beveridge, Monica Chawathe, and Charles Ross. High-level language
abstraction for reconfigurable computing. Computer, 36(8):6369, 2003.
[71] Timothy J Callahan, John R Hauser, and John Wawrzynek. The garp architec-
ture and c compiler. Computer, 33(4):6269, 2000.
[72] Maya B Gokhale and Janice M Stone. Napa c: Compiling for a hybrid
risc/fpga architecture. In FPGAs for Custom Computing Machines, 1998.
Proceedings. IEEE Symposium on, pages 126135. IEEE, 1998.
[73] Y Explorations. excite c to rtl behavioral synthesis 4.1 (a). Y Explorations
(YXI), 2010.
References 91
[74] Jason Villarreal, Adrian Park, Walid Najjar, and Robert Halstead. Designing
modular hardware accelerators in c with roccc 2.0. In Field-Programmable
Custom Computing Machines (FCCM), 2010 18th IEEE Annual International
Symposium on, pages 127134. IEEE, 2010.
[75] Calypto Design. Catapult: Product family overview, 2014.
[76] Lars E Thon and Robert W Brodersen. C-to-silicon compilation. In Proc. of
CICC, pages 117, 1992.
[77] Sumit Gupta, Nikil Dutt, Rajesh Gupta, and Alexandru Nicolau. Spark: A
high-level synthesis framework for applying parallelizing compiler transfor-
mations. In VLSI Design, 2003. Proceedings. 16th International Conference
on, pages 461466. IEEE, 2003.
[78] Hui-zheng ZHANG and Peng ZHANG. Integrated design of electronic prod-
uct based on altium designer [j]. Radio Communications Technology, 6:019,
2008.
[79] Ivan Aug, Frdric Ptrot, Franois Donnet, and Pascal Gomez. Platform-
based design from parallel c specifications. IEEE transactions on computer-
aided design of integrated circuits and systems, 24(12):18111826, 2005.
[80] Justin L Tripp, Maya B Gokhale, and Kristopher D Peterson. Trident: From
high-level language to hardware circuitry. Computer, 40(3), 2007.
[81] Ravikesh Chandra. Novel Approaches to Automatic Hardware Acceleration
of High-Level Software. PhD thesis, ResearchSpace@ Auckland, 2013.
[82] Erdal Oruklu, Richard Hanley, Semih Aslan, Christophe Desmouliers, Fer-
nando M Vallina, and Jafar Saniie. System-on-chip design using high-level
synthesis tools. Circuits and Systems, 3(01):1, 2012.
[83] P Banerjee, N Shenoy, A Choudhary, S Hauck, C Bachmann, M Chang,
M Haldar, P Joisha, A Jones, A Kanhare, et al. Match: A matlab compiler for
configurable computing systems. IEEE Computer Magazine, 1999.
[84] Andrew Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, Prasanna Sun-
dararajan, and Susan Eggers. Chimps: A c-level compilation flow for hybrid
cpu-fpga architectures. In Field Programmable Logic and Applications, 2008.
FPL 2008. International Conference on, pages 173178. IEEE, 2008.
[85] Kiran Bondalapati, Pedro Diniz, Phillip Duncan, John Granacki, Mary Hall,
Rajeev Jain, and Heidi Ziegler. Defacto: A design environment for adaptive
computing technology. In International Parallel Processing Symposium, pages
570578. Springer, 1999.
[86] Oliver Pell, Oskar Mencer, Kuen Hung Tsoi, and Wayne Luk. Maximum per-
formance computing with dataflow engines. In High-Performance Computing
Using FPGAs, pages 747774. Springer, 2013.
92 References
[87] Satnam Singh and David J Greaves. Kiwi: Synthesis of fpga circuits from
parallel programs. In Field-Programmable Custom Computing Machines,
2008. FCCM08. 16th International Symposium on, pages 312. IEEE, 2008.
[88] Justin L Tripp, Preston A Jackson, and Brad L Hutchings. Sea cucumber:
A synthesizing compiler for fpgas. In International Conference on Field
Programmable Logic and Applications, pages 875885. Springer, 2002.
[89] CK Cheng and Liang-Jih Chao. Method and apparatus for clock tree solution
synthesis based on design constraints, April 2 2002. US Patent 6,367,060.
[90] Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers,
and Zhiru Zhang. High-level synthesis for fpgas: From prototyping to deploy-
ment. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 30(4):473491, 2011.
[91] Tom Feist. Vivado design suite. White Paper, 5, 2012.
[92] Leon Stok. Data path synthesis. Integration, the VLSI journal, 18(1):171,
1994.
[93] Christian Pilato, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P
Carloni. System-level memory optimization for high-level synthesis of
component-based socs. In Hardware/Software Codesign and System Syn-
thesis (CODES+ ISSS), 2014 International Conference on, pages 110. IEEE,
2014.
[94] B Ramakrishna Rau. Iterative modulo scheduling: An algorithm for software
pipelining loops. In Proceedings of the 27th annual international symposium
on Microarchitecture, pages 6374. ACM, 1994.
[95] Florent De Dinechin. Multiplication by rational constants. IEEE Transactions
on Circuits and Systems II: Express Briefs, 59(2):98102, 2012.
[96] Martin Kumm, Martin Hardieck, Jens Willkomm, Peter Zipf, and Uwe Meyer-
Baese. Multiple constant multiplication with ternary adders. In Field Pro-
grammable Logic and Applications (FPL), 2013 23rd International Confer-
ence on, pages 18. IEEE, 2013.
[97] Ganesh Lakshminarayana, Anand Raghunathan, and Niraj K Jha. Incorporat-
ing speculative execution into scheduling of control-flow intensive behavioral
descriptions. In Proceedings of the 35th annual Design Automation Confer-
ence, pages 108113. ACM, 1998.
[98] Hongbin Zheng, Qingrui Liu, Junyi Li, Dihu Chen, and Zixin Wang. A
gradual scheduling framework for problem size reduction and cross basic
block parallelism exploitation in high-level synthesis. In Design Automation
Conference (ASP-DAC), 2013 18th Asia and South Pacific, pages 780786.
IEEE, 2013.
References 93
[110] Cludio Machado Diniz, Muhammad Shafique, Sergio Bampi, and Jorg
Henkel. High-throughput interpolation hardware architecture with coarse-
grained reconfigurable datapaths for hevc. In Image Processing (ICIP), 2013
20th IEEE International Conference on, pages 20912095. IEEE, 2013.
[111] Grzegorz Pastuszak and Maciej Trochimiuk. Architecture design and effi-
ciency evaluation for the high-throughput interpolation in the hevc encoder. In
Digital System Design (DSD), 2013 Euromicro Conference on, pages 423428.
IEEE, 2013.
[112] Sotirios Xydis, Gianluca Palermo, Vittorio Zaccaria, and Cristina Silvano.
Spirit: Spectral-aware pareto iterative refinement optimization for supervised
high-level synthesis. IEEE Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems, 34(1):155159, 2015.
[113] Cadence Design Systems Inc. Silicon realization enables next-generation ic
design. Cadence EDA360 White Paper, 2010.
[114] Y. Chen, G. Tech, K. Wegner, and S. Yea. Test model 11 of 3d-hevc and
mv-hevc. In JCT3V meeting, JCT3VK1003, 2015.