0% found this document useful (0 votes)
5 views12 pages

Shen 2016

2016 shen

Uploaded by

Bhavya Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

Shen 2016

2016 shen

Uploaded by

Bhavya Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1

A Combined Deblocking Filter and SAO


Hardware Architecture for HEVC
Weiwei Shen, Yibo Fan*, Yufeng Bai, Leilei Huang, Qing Shang, Cong Liu and Xiaoyang Zeng

 unit (TU). CU is the basic unit of region splitting used for


Abstract—The latest video coding standard High Efficiency intra/inter coding [4]. It can be split from the largest coding unit
Video Coding (HEVC) provides 50% improvement in coding (LCU, can be as large as 64x64 pixels) into the smallest coding
efficiency compared to H.264/AVC, to meet the rising demands unit (SCU, 8x8 pixels). Coupled with CU, PU carries the
for video streaming, better video quality and higher resolution.
The deblocking filter (DF) and Sample Adaptive Offset (SAO)
information related to the prediction processes. TU is used for
play an important role in HEVC encoder, and the SAO is newly transform and quantization, and it depends on PU partitioning
adopted in HEVC. Due to the high throughput requirement in modes.
video encoder, design challenges such as data dependency, Blocking effect is known as one of the most visual and
external memory traffic and on-chip memory area become even objectionable artifacts of block-based compression methods [5].
more critical. To solve these problems, we first propose an Artifacts that are commonly seen in prior video coding
interlacing memory organization on the basis of quarter-LCU to
resolve the data dependency between vertical and horizontal
standards at medium and low bitrates, such as blocking artifacts,
filtering of deblocking filter. The on-chip SRAM area is also ringing artifacts, color biases and blurring artifacts [6], may still
reduced to about 25% on the basis of quarter-LCU scheme exist in HEVC. Also HEVC adopts in-loop filters in order to
without throughput loss. We also propose a simplified bitrate reduce these artifacts. The HEVC defines two in-loop filters,
estimation method of rate-distortion cost calculation to reduce the shown in Fig. 1. In addition to the DF similar to the one in
computational complexity in the mode decision of SAO. Our H.264/AVC, HEVC further introduces a completely new tool:
proposed hardware architecture of combined deblocking filter
and SAO is designed for HEVC intra encoder, and the proposed
sample adaptive offset (SAO). Meanwhile, the DF leads to
simplified bitrate estimation method of SAO can be applied to 1.3-3.3% BD-rate reduction [7] on average, and the SAO
both intra and inter coding. As a result, our design can support achieves 3.5% BD-rate reduction [8] at the same quality. And
Ultra-High-Definition (UHD) 7680x4320 at 40fps applications at the BD-rate calculation is introduced in [15].
merely 182MHz working frequency. Total logic gate count is
103.3K in 65nm CMOS Process.
Intra
Index Terms—HEVC, deblocking filter, sample adaptive offset, Prediction
Entropy Bitstream
hardware implementation, UHD Original - T Q
Coding
Image
ME/MC

I. INTRODUCTION Reconstruction IQ IT

I TU-T video Coding Experts Group (VCEG) and ISO/IEC


Moving Picture Experts Group (MPEG) formed a Joint
Collaborative Team on Video Coding (JCT-VC) in 2010, and
Reference
Picture
Deblocking
Filter
In-loop
Filters
the next generation coding standard, High Efficiency Video SAO
Coding (HEVC), is now being developed. HEVC aims to
reduce 50% bit rate in comparison with the existing
Fig. 1. Block diagram of the HM9.0 encoder
H.264/AVC high profile, under the same visual quality [1].
As the previous video coding standard H.264/AVC, HEVC
The coding efficiency of HEVC comes with a cost. The
adopts the block based hybrid coding framework [2], [3]. A
computational complexity of HEVC is very high. From an
quadtree based coding structure is an important feature of the
encoder perspective, an encoder fully exploiting the capabilities
HEVC standard.
of HEVC is expected to be several times more complex than an
Several new coding structures have been introduced in
H.264/AVC encoder. To understand the computational
HEVC: coding unit (CU), prediction unit (PU) and transform
complexity of the HEVC, a study mapped the HEVC codec into
Manuscript received xxx, 2014; revised xxx, 2015; accepted xxx, 2016. existing systems. In [9], authors mapped the HM (HEVC Test
Date of publication xxx, 2016. This work was supported in part by National Model) encoder into a cluster containing Xeon-based servers
Natural Science Foundation of China under Grant 61306023. (E5670 clocked at 2.93GHz) and using gcc 4.4.5. Even for
The authors are with State Key Lab of ASIC and System, Fudan University,
Shanghai, China. Email: {10110720024, fanyibo, 10300720005, 12212020001, intra-only case, the encoding time at least exceeds 1000 times
11212020039, 11212020033, xyzeng}@fudan.edu.cn real-time (sequences’ resolution is 832x480@30fps).
Corresponding author: Yibo Fan. Email: [email protected]

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2

Due to an increasing diversity of services, the growing


popularity of HD video, and the emergence of beyond-HD boundary
formats (e.g. 4Kx2K or 8Kx4K resolution) are creating even decision
stronger needs for high throughput in hardware implementation
superior to H.264/AVC. Hence, hardware realization of the
HEVC standard for real-time applications is an essential and BS caculation
challenging task.
There are some previous works on the topic of in-loop filters.
β, tc, d
In [18], a five-stage pipelined and hybrid edge filtering
decision
sequence are applied; in [19], a five-stage pipelined and
re-source-shared dual-edge filter to generate two filtering
results every cycle is proposed; in [23], a parallelized scheme of filter on/off
processing the luminance and chrominance samples decision
simultaneously is proposed. But all these works do not support
HEVC in-loop filters. In [10], a HEVC in-loop filters
architecture composed of fully utilized DF and SAO is strong weak no
proposed. But it does not support HEVC encoder. In [24], a filter filter filter
hybrid pipeline with two processing levels is proposed for
Fig. 2. Overall processing flow of deblocking filter
HEVC DF, which uses one 1-D filter and single-port on-chip
SRAM. In [25], two parallel datapaths are used for the design of the boundary
HEVC DF in order to increase its performance. Here, a involved in filtering
combined DF and SAO hardware architecture for HEVC
encoder is proposed. p3,0 p2,0 p1,0 p0,0 q0,0 q1,0 q2,0 q3,0 the first line

A. Motivation p3,1 p2,1 p1,1 p0,1 q0,1 q1,1 q2,1 q3,1 the second line
Among many algorithms adopted by HEVC, two in-loop
filters’ algorithm (DF and SAO) requires significant CPU times. p3,2 p2,2 p1,2 p0,2 q0,2 q1,2 q2,2 q3,2 the third line
According to [9], the DF and SAO cost about one-fifth CPU
times of the whole HEVC codec, which are two complicated p3,3 p2,3 p1,3 p0,3 q0,3 q1,3 q2,3 q3,3 the fourth line

modules in HEVC.
Moreover, beyond-HD format’s applications become more P block Q block
and more popular, which are at the expense of large external Fig. 3. Two adjacent 4x4 blocks
traffic. For example, the DF and SAO require the demand of
0.93Gbps, 0.58Gbps I/O bandwidth respectively at an TABLE I
DEFINITION OF BS VALUES
application of 2K x 1K@30fps (2560x1600), without any
Conditions BS value
data-reuse scheme. Thus a high throughput hardware
architecture for DF and SAO becomes a critical issue in HEVC At least one of the P or Q blocks is intra 2
codec design. At least one of the P or Q blocks has non-zero coded
1
residual coefficient and the boundary is a TU boundary
B. Design Approach
Absolute differences between MVs between the P and Q
Due to an increasing demand of high resolution applications, 1
blocks are >= 1 in units of integer pixels
the high data access between on-chip memory and external Motion-compensated prediction for the P and Q blocks
memory becomes even more critical. Considering the tradeoffs refers to different pictures or the number of MVs is 1
between on-chip memory area and external memory traffic, we different for the two blocks
present an interlaced pipeline to combine the DF with SAO on a Otherwise 0
quarter-LCU basis; the quarter-LCU is defined as a 32x32
pixels’ block. In the process of DF, a novel filter is suggested in simplified bitrate estimation method of rate-distortion cost
order to keep the same result on a picture basis based on the calculation is adopted to reduce the computational complexity
quarter-LCU structure. Meanwhile, we also propose an in the mode decision of SAO.
interlacing memory scheme to arrange the data in on-chip With these design approaches, a LCU can be filtered in 558
memory, and access the data in the process of both vertical and cycles, and our design can support UHD (7680×4320) at 40fps
horizontal filtering efficiently in the DF phase. applications at merely 182MHz working frequency.
In the process of statistics collection in SAO, the overall
number of comparators is reduced by 83% with our proposed C. Organization of the Paper
configurable comparator array. We also present a The rest of this paper is organized as follows. Section II
fragmentation adder scheme to balance the computational introduces the DF and SAO algorithm in HEVC. In Section III,
burden between pipeline stages of SAO. Meanwhile, a our proposed hardware architecture is described. Section IV
formulates the simplified bitrate estimation method of

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3

a a a Starting
band position

a c b c c c
b b b Signal four bands’ offsets from the starting band
Minimum Maximum
EO_0 EO_1 EO_2 EO_3 pixel value pixel value
horizontal vertical 135°diagonal 45°diagonal
Fig. 5. Four bands are grouped together and signal from the starting band
Fig. 4. Four 1-D directional patterns for EO classification
merge
Y
TABLE II up
LCU
SAMPLE CATEGORIES IN EDGE OFFSET
merge current Cb
Category Condition left LCU
LCU LCU

1 c < a && c < b Cr


LCU
2 ( c < a && c == b ) || ( c < a && c == b )
3 ( c < a && c == b ) || ( c < a && c == b )
Fig. 6. SAO merging mode
4 c > a && c > b
0 None of the above red circles shown in Fig. 3. Note that this decision is shared
across four lines (four rows for vertical boundary filtering,
rate-distortion cost calculation in the mode decision of SAO. while four columns for horizontal boundary filtering).
The hardware implementation result and comparison are Meanwhile, each line has its respective values, such as dE, dEp
provided in Section V, followed by a conclusion in Section VI. and dEq, which are used for the decision of filter on/off, strong
and weak filter for its own line.
II. THE DEBLOCKING FILTER AND SAO ALGORITHM IN HEVC dE ≠ 0, BS ≠ 0 (1)
In HEVC, two in-loop filters, the DF followed by SAO are BS > 1 (2)
applied to the reconstructed samples. The DF is intended to
For example, the values of eight pixels across the boundary
reduce the artifacts caused by block-based coding. While the
are denoted as p3,0, p2,0, p1,0, p0,0, q0,0, q1,0, q2,0, q3,0, which are
DF is only applied to the samples located at PU or TU
labeled as the first line in Fig. 3. For the luminance component,
boundaries, the SAO is applied adaptively to all samples
the first line is filtered only when the condition of equation (1)
satisfying certain conditions, (e.g. based on gradient) [11], in
is satisfied. For the chrominance component, the first line is
order to reduce the mean sample distortion of a region.
filtered only when the condition of equation (2) is satisfied. The
A. The Deblocking Filter Algorithm decisions of strong and weak filter are detailed in [3].
The DF is applied to all samples adjacent to a PU or TU B. The SAO Algorithm
boundaries except the picture boundary. Unlike H.264/AVC,
SAO is located after the DF and also belongs to in-loop
where the DF is applied on a 4x4 sample grid, HEVC only
filters, which modifies the samples after the DF. The concept of
applies the DF to the PU or TU boundaries which are also
SAO is to classify reconstructed pixels into different categories,
aligned on an 8x8 sample grid.
obtaining an offset for each category, and the adding to each
Fig. 2 illustrates the overall processing flow of the DF [3].
sample of the category [7]. It is performed on LCU basis in HM
Firstly, the decision is made that whether the current boundary
9.0 (reference software for HEVC).
is a boundary of PU or TU. If not, the filter isn’t applied to the
Two SAO types are adopted in HEVC: Edge Offset (EO) and
current boundary. Given that P block and Q block are two
Band Offset (BO). For EO, the sample classification is based on
adjacent 4x4 blocks with the boundary involved in filtering,
comparison between current samples and neighboring pixels.
shown in Fig. 3. The boundary strength (BS) reflects how
For BO, the sample classification is based on sample values [7].
strong the filter is needed for the boundary, which is controlled
Note that, each color component (Y, Cb and Cr) has its own
by several syntax elements. The BS value can take one of three
SAO parameters. To reduce the information to be coded by
possible values: 0, 1 and 2. The definition of the BS value is
entropy coding, the current LCU can reuse the SAO parameters
clearly shown in Table I. If the BS value is greater than zero,
from left LCU or upper LCU by SAO merging mode.
additional conditions are checked in order to determine whether
Edge Offset: EO uses four 1-D direction patterns for pixel
the DF should be applied to the block boundary, and also make
classification including EO_0 class (horizontal), EO_1 class
a selection on strong/weak filter.
(vertical), EO_2 class (135° diagonal) and EO_3 class (45°
According to the BS value and the quantization parameter
diagonal), shown in Fig. 4. Sample labeled as “c” indicates a
(QP) of P and Q blocks, two thresholds, β and tc, are
current sample to be considered. Two samples labeled as “a”
determined from pre-defined tables. Together with β and tc, the
and “b” specify two neighboring pixels. According to these
value d is used for filter on/off decision and strong/weak filter
patterns, four EO classes are specified, each class corresponds
selection, which is derived from the value of twelve pixels in
to one pattern. Please note that, only one EO class would be
the first and the fourth line. These twelve pixels are labeled as

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4

External current External current quarter-LCU


External V/H line External V/H line
quarter-LCU pixels for DF original pixels for SAO
buffer of post-DF pixels buffer of original pixels
two-port SRAM two-port SRAM

Data controller

Statistics
SRAM_0
Difference Statistics
Deblocking filter SAO SRAM_1
date fetch data fetch Statistics
Fragmentation SRAM_2
Four-line adders Statistics
CU filter core Deblocking SRAM_3
information BS, β, tc filter Configurable
Calculation data store comparator SAO side
array Fast distortion information
estimation to be coded

Deblocking filter phase


SAO statistics collection phase SAO mode decision phase

Fig. 7. Proposed hardware architecture of combined deblocking filter and SAO

(merge-left) or upper LCU (merge-up) as shown in Fig. 6.


quarter-LCU picture Please note that the SAO information (SAO type and offsets)
LCU
are shared by Y, Cb and Cr components. If current LCU selects
Quarter-LCU_0 the merge-left or merge-up, all SAO information of current
Quarter-LCU_1
LCU are reused from the left or upper LCU. The SAO merging
Quarter-LCU_3 mode can reduce side information to be coded by entropy
coding effectively.
Fast distortion estimation: The selection of SAO type and
Quarter-LCU_2
offsets is performed on fast distortion estimation method based
on a region of LCU in HM 9.0. The fast distortion estimation
method [8] implementation for SAO would need to add offsets
Fig. 8. Proposed z-scan order of quarter-LCU in a picture level to pre-SAO samples (post-DF samples) to generate post-SAO
selected for each LCU that enables EO. samples and then calculate distortion between original samples
For a given EO class, each sample is classified into one of the and post-SAO samples, which can be implemented as follows.
five categories by comparing its own value with the two Let k, p(k) and d(k) be the pixel positions, original pixels and
neighboring samples’ value as shown in Table II. If the current pre-SAO pixels, where k belongs to C and C is a set of pixels
sample does not belong to any of the category 1-4, then SAO which belong to specified SAO type (EO, BO), a specified
would be not applied to it. category of an EO class or a starting band in BO. The distortion
Band Offset: BO performs the sample classification based between original pixels and pre-SAO pixels can be calculated
on the sample’s own value. The BO classifies all pixels of a in the following equation:
LCU into multiple bands where each band contains the pixels in 𝐷𝑝𝑟𝑒 = ∑𝑘∈𝐶(𝑝(𝑘) − 𝑑(𝑘))2 (3)
the same intensity interval in HM 9.0. The concept of band in
BO is similar to the category in EO. The pixel intensity range is The distortion between the original pixels and the post-SAO
equally divided into 32 uniform bands from zero to the pixels can be described in the following equation. In (4), O is
maximum value (e.g. 255 for 8-bit pixels), and each band has the offset of a given pixel set:
its offset, shown in Fig. 5. 2
Only four offsets of four consecutive bands and the starting 𝐷𝑝𝑜𝑠𝑡 = ∑𝑘∈𝐶 (𝑝(𝑘) − (𝑑(𝑘) + 𝑂)) (4)
band position would be coded by entropy coding to be The delta distortion is defined in the following equation:
transmitted to the decoder. The possible choice number of
starting band is 29. The reason of selecting only four bands is ∆𝐷 = 𝐷𝑝𝑜𝑠𝑡 − 𝐷𝑝𝑟𝑒 = 𝑁𝑂2 − 2𝑂𝐸 (5)
that the sample range in a region can be quite limited after the
regions are reduced from picture quadtree partitions to CTBs 𝐸 = ∑𝑘∈𝐶 (𝑝(𝑘 ) − 𝑑(𝑘)) (6)
[9].
In (5), N is the number of pixels of a given set, and E is the sum
SAO merging: The current LCU can reuse the SAO
of differences between original pixels and pre-SAO pixels as
parameters (SAO type and four offsets) from left LCU

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5

defined in the equation (6). Next, the delta rate-distortion cost is vertical boundaries and sixteen horizontal boundaries should be
defined in the following equation: filtered by DF, the number of 4x4 block memory reading
accesses is 867 (=16x17+2x8x9+17x17+2x9x9), and the
∆𝐽 = ∆𝐷 + 𝜆𝑅 (7)
number of 4x4 block memory writing accesses is 867 without
In (7), λ is the Lagrange multiplier, and R represents the data-reuse scheme. Moreover, the SAO requires both the
estimated bits of side information for a specified SAO type. The post-DF and original pixels, the number of 4x4 block memory
calculation of λ is performed by entropy coding which is a reading accesses is 1048 (=2x(17x17+2x9x9+73)), while the
time-consuming process in HM 9.0. A simplified bitrate number of 4x4 block memory writing accesses is 0 (the SAO
estimation method is proposed to reduce the computation just send the selected type and offsets to the system) without
complexity, which is elaborated in Section IV. data-reuse scheme. Em can be calculated as the following
equation:
III. PROPOSED HARDWARE ARCHITECTURE
(867  1048)  (867  0)
Fig. 7 shows an overall proposed hardware architecture of Em   7.2 (9)
16 16  8  8  8  8
combined DF and SAO. We adopt an interlaced pipeline
architecture to speed up the combined DF and SAO. The whole It is observed that each pixel would be access about eight
process is partitioned into three phases: DF, SAO statistics times to complete the DF and SAO, which causes a huge
collection and SAO mode decision. overhead of the external memory. In order to reduce the
The whole design is based on quarter-LCU (32x32 pixels for external memory bandwidth, the on-chip memory scheme is a
Y, 16x16pixels for Cb and Cr) scheme which has the benefit of well-known method to store the reuse data in chip. Here, a
on-chip SRAM (which stores the reconstructed and original novel filter order scheme is proposed to reuse the temporary
pixels) area reduction by about 75% compared to the LCU basis, data after vertical boundary filtering for further horizontal
without throughput loss. Owing to that the SAO is performed boundary filtering in DF phase, cooperating with the proposed
on a LCU basis in HM 9.0, a z-scan order (shown in Fig. 8) is on-chip memory organization. Moreover, the combined DF and
proposed to keep the same results in picture level (from SAO hardware architecture can reuse the post-DF pixels for
quarter-LCU_0 to quarter-LCU_3 in a LCU). SAO statistics collection phase.
In DF phase, a four-stage pipeline is adopted to enhance the With the above methods, the block size is set as a
throughput on the basis of a four-line unit, while a two-stage quarter-LCU size (32x32 pixels) according to equation (8). The
pipeline is also adopted in SAO statistics collection phase. number of 4x4 block memory reading accesses is 112
During the DF process, the statistics information for deriving (=8x9+2x4x5), and the number of 4x4 block memory writing
offset and fast distortion estimation of SAO is collected after accesses is 16 (=8+2x4) of DF. Meanwhile, the SAO just reads
horizontal edges filtering. Consequently, an interlaced pipeline the original pixels from the external memory, and the number
scheme for DF and SAO is proposed to avoid fetching any of 4x4 block memory reading accesses is 112 (=8x9+2x4x5).
post-DF pixels from SRAM and improve the throughput And Em for the above data-reuse methods can be expressed as:
significantly.
(112  112)  (16  0)
A. Analysis of Data-Reuse for the Combined DF and SAO Em   2.5 (10)
88  4  4  4 4
In order to clarify the advantage of the combined DF and
SAO hardware architecture, an external memory access factor It means that the total number of 4x4 block external memory
Em is defined to evaluate the efficiency of the external memory accesses dramatically reduces to 35%, which greatly eases the
access for DF and SAO, the Em is expressed as external memory traffic. Moreover, the combined DF and SAO
hardware architecture can eliminate the SRAMs of storing
Em  Em  Em post-DF pixels for SAO. The SAO can utilize the post-DF
r w
the number of memory reading accesses from the bus of a block pixels directly without using SRAMs for storing the post-DF
 pixels.
the number of 4  4 blocks of a block
the number of memory writing accesses to the bus of a block (8) B. Bandwidth Utilization Discussion

the number of 4  4 blocks of a block The external memory access factor Em is a theoretical
total number of memory accesses of a block number to evaluate the efficiency of the external memory

the number of 4  4 blocks of a block access. Furthermore, the bandwidth utilization of external
memory is an extremely important issue in a real video encoder
where Emr and Emw represent the average number of memory design, especially for DRAM.
accesses from and to the bus per 4x4 block, respectively. Em DRAM, such as SDRAM or DDR1/2/3, is widely used as
stands for the average number of external memory accesses per external memory, which prefers block-based access. In other
4x4 block for DF and SAO. It is well known that the smaller Em words, if the desired data is tiny and fractional distributed in
has the greater reduction of the external memory bandwidth. external memory, then the pre-fetched data mechanism of DDR
In the general hardware design, the DF and SAO are based would be wasted, resulting in a very low utilization of external
on a LCU level, so the block size is set as a LCU size (64x64 memory bandwidth. The improvements of the DRAM
pixels) according to equation (8). Because of the sixteen

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6

H line buffer SRAM_T


SRAM_L1 SRAM_L2
V line buffer
T9 T10 T11 T12 T13 T0
L0 L1
T1
P5 N5 h5 L2 L3
L8 C64 C65 C66 C67
T0 T1 T2 T3 T4 T5 T6 T7 T8
P1 9 10 11 12 13 14 15 N1 h1 L9 C68 C69 C70 C71
L0 1 C0 C1 C2 C3 C4 C5 C6 C7 L12 L13
P6 N6 h6 T17
L14 L15
L10 C72 C73 C74 C75 T18
L1 2 C8 C9 C10 C11 C12 C13 C14 C15
P2 N2 h2 L11 C76 C77 C78 C79 SRAM_C1 SRAM_C2
L2 3 C16 C17 C18 C19 C20 C21 C22 C23
v5 v6 C0 C1
C2 C3
L3 4 C24 C25 C26 C27 C28 C29 C30 C31 16x16 chroma (Cb)
P3 N3 h3
L4 5 C32 C33 C34 C35 C36 C37 C38 C39 Y Y

T14 T15 T16 T17 T18


L5 6 C40 C41 C42 C43 C44 C45 C46 C47 P7 N7 h7
P4 N4 h4 L12 C80 C81 C82 C83 C63 C62
L6 7 C48 C49 C50 C51 C52 C53 C54 C55 C64 C65

L13 C84 C85 C86 C87 Cb Cb


L7 8 C56 C57 C58 C59 C60 C61 C62 C63
P8 N8 h8
C79 C78
L14 C88 C89 C90 C91
v1 v2 v3 v4 C80 C81

32x32 luma (Y) L15 C92 C93 C94 C95 Cr Cr

v7 v8 C95 C94

16x16 chroma (Cr)


Fig. 9. Memory organization of reconstructed pixels

Vertical
edges to be DF and SAO
filtered statistics collection
for Cb&Cr SAO
DF and SAO DF and SAO DF and SAO mode
DF_0 DF_1 DF_2 DF_3 DF_0 DF_1 DF_2 DF_3 DF_0 DF_1 DF_2 DF_3
statistics statistics statistics decision
collection for collection for collection for
quarter_LCU_1 quarter_LCU_2 quarter_LCU_3
SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0

SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1

DF and SAO statistics collection for Y of quarter_LCU_0

DF and SAO statistics collection for quarter_LCU_0

0 32 35 36 37 38 72 120 240 379 518 558


cycle
Fig. 10. Proposed timing diagram of a LCU

bandwidth utilization should be a serious concern in a real and then the horizontal edges are filtered with samples
video encoder design. Thus, great efforts should be put on how modified by DF of vertical edges as input. The vertical edges
to join fractional data into continuous data chain stored in are filtered starting with the edge on the left-hand through the
external memory. edges towards the right-hand in their geometrical order. The
In our external memory system, we use DDR2 (data width is horizontal edges are filtered starting with the edge on the
32bits, burst length is 4,equal to a 4x4 block one access time), top-hand through the edges towards the bottom-hand in their
and the pixels in DDR2 are ordered as a quarter-LCU as a big geometrical order. This is the basic filter order principle of DF
chain, in which the pixels are ordered as 4x4 blocks. With our in HEVC [3].
proposed data-reuse scheme, we only access the whole current 2) Boundary Level Filter Order
quarter-LCU and a row of 4x4 blocks from the neighboring top The boundaries involved in filtering on a quarter-LCU basis
quarter-LCU. And the data in on-chip SRAMs are stored with are labeled as red lines in Fig. 9. The vertical boundaries
4x4 blocks as a basic unit. So we always access consecutive including {v1, v2, v3, v4} are filtered from left to right (if any),
integral multiple of 4x4 blocks’ data, and there is no overhead and the horizontal boundaries including {h1, h2, h3, h4} are
in external memory access. filtered from top to bottom (if any) for component Y. And the
filter order of component Cb and Cr are similar to the one of
C. A Novel Filter Order Scheme component Y.
In this section, three filter order levels, named picture level, 3) Four-line Unit Level Filter Order
boundary level and four-line unit level, are defined to describe Given that the four-line unit is composed of two adjacent 4x4
our proposed filter order. blocks in Fig. 3. Each boundary consists of several four-line
1) Picture Level Filter Order units. The vertical boundaries are filtered from top four-line
The vertical edges in a reconstructed picture are filtered first,

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7

pixels involved in
SAO statistics
collection

deblocking filtered
pixels

horizontal four-line
unit edge involved in
deblocking filter

Fig. 12. Proposed pixels’ selection of combined deblocking filter and SAO

Fig. 11. Four-stage pipeline for DF


C C C C C
unit to the bottom four-line unit (e.g., from four-line unit 1 to
four-line unit 8 in vertical boundary v1). The horizontal C C C C C
boundaries are filtered from left four-line unit to the right C C C C C
four-line unit (e.g. , from four-line unit 9 to four-line unit 15 in
C C C C C
horizontal boundary h1).
For our quarter-LCU structure, we adopt the filter order C C C C C A pixel whose
information should
scheme combining the boundary level filter order with the four C C C C C be collected
line level filter order in a quarter-LCU. Firstly, vertical C fixed comparator
C C C C C
boundaries are filtered from v1 to v4, then from h1 to h4 for
component Y (the process for Cb,Cr is similar to Y). And each C C C C C C configurable comparator

boundary obeys the four-line unit level filter order. Note that, C C C C
four-line unit {N1, N2, N3, N4, N5, N6, N7, N8} would not be
filtered in current quarter-LCU by explicitly setting the BS Fig. 13. Proposed configurable comparator array
value to 0, it is expected that the current quarter-LCU is on the
right boundary of the entire picture. These lines would be SRAM_L2 are from the left neighboring quarter-LCU.
filtered in right quarter-LCU which is on the right of current Meanwhile, the pixels in SRAM_T come from the top
quarter-LCU, corresponding to {P1, P2, P3, P4, P5, P6, P7, P8} neighboring quarter-LCU and left-top neighboring
in the right quarter-LCU to keep the same result in HM 9.0. quarter-LCU. Each square in Fig. 9 stands for a 4x4 pixels of
Owing to that the SAO is performed on LCU basis in HM 9.0, luminance or chrominance component. For example, the grey
a z-scan order (shown in Fig. 8) is proposed to keep the same squares from C0 to C95 belong to SRAM_C1, while the white
results in picture level (from quarter-LCU_0 to quarter-LCU_3 squares from C1 to C94 belong to SRAM_C2. With our
in a LCU). Once the DF and SAO statistics collection phase is proposed interlacing memory organization, two 4x4 pixel
completed of quarter-LCU_0, quarter-LCU_1, quarter-LCU_2 blocks on both sides of every boundary always come from
and quarter-LCU_3 (shown in Fig. 10), the SAO mode decision different memory modules. As a result, all pixels can be easily
phase for a LCU would be invoked. accessed from the SRAMs on the vertical and horizontal
filtering operations.
D. Proposed Memory Organization A two-port SRAM can support one read and one write at the
An on-chip memory is used to reduce the I/O bandwidth same clock cycle. In our proposed design, we utilize five
between the chip and the system. The main challenge is to two-port SRAMs for reconstructed pixels. These SRAMs can
properly arrange the data in memory modules in order to access also be used to store the immediate pixels for horizontal edges
the data smoothly from the memory modules to the in-loop filtering after vertical edges filtering. Therefore, no more
filter core, on both vertical and horizontal boundaries. We on-chip memories are needed for the DF.
present an interlacing memory organization to access the data Due to the directional patterns of EO in SAO statistics
effectively in the processing of both vertical and horizontal collection phase, another V and H line buffers are needed to
filtering for the DF. store some post-DF pixels for the category selection (shown in
Our approach is to divide the on-chip memory of Fig. 9, each yellow and green circle stand for a pixel). Another
reconstructed pixels (used for the DF) into five modules two SRAMs (SRAM_O1, SRAM_O2) and V, H line buffers
(SRAM_C1, SRAM_C2, SRAM_L1, SRAM_L2, and are also adopted to store the original pixels for SAO statistics
SRAM_T), shown in Fig. 9. SRAM_C1 and SRAM_C2 store collection. The organization of these memories is similar to the
the pixels of the current quarter-LCU; those in SRAM_L1 and

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8

pipeline, is proposed to reduce the chip area of SAO statistics


collection phase to about 25%. As shown in Fig. 10, each
pipeline stage of DF and SAO takes one cycle. The pipeline of
DF would be invoked at the interval of four cycles in order to
wait until all 32 pixels’ information statistics are collected by
SAO. The function of each stage of SAO statistics collection
phase is illustrated as follows (using the notations in Fig.10):
Stage 0 (SAO_0): Obtain 32 pixels’ category and band
selection information of EO_0, EO_1, EO_2, EO_3 and BO,
and perform some adder calculation of equation (6) at the same
time.
Stage 1 (SAO_1): Complete the rest adder calculation of
equation (6).
SAO takes four cycles to collect statistics information for a
four-line unit. For example, EO_0 and BO’ 0-7 bands are
Fig. 13. Proposed fragmentation adder scheme processed in cycle 35; EO_1 and BO’ 8-15 bands are processed
Fig. 14. Proposed fragmentation adder scheme
in cycle 36; EO_2 and BO’ 16-23 bands are processed in cycle
memory scheme of the reconstructed pixels in DF. 37; EO_3 and BO’ 24-31 bands are processed in cycle 38
With the proposed memory organization based on shown in Fig. 10. The interlaced pipeline scheme achieves a
quarter-LCU structure, the on-chip memory area would be good tradeoff between high-throughput and chip area.
reduced to about 25% compared to the basis of LCU. Configurable comparator array and fragmentation adder
E. Pipeline Schedule scheme are proposed to cooperate with the SAO pipeline. Fig.
13 shows the proposed configurable comparator array
Fig. 10 shows the timing diagram of our design. We employ employed to category classification in EO. In Fig. 4, “c” is the
a four-stage pipeline which can filter four lines simultaneously current pixel, and “a” and “b” are the two neighboring pixels.
in DF phase (shown in Fig. 7). The function of each stage is As shown in Table II, the current pixel “c” would be compared
illustrated as follows, depicted in Fig. 11: to “a” and “b”. Data reuse between pixels can be further applied
Stage 0 (DF_0): Calculate the BS value according to the for the next samples’ category classification. For example,
information such as the prediction mode, motion vector and so assuming the SAO type is EO_0, the current pixel’s
on. Reading the pixels to be filtered from on-chip SRAM comparison value of “c” and “b” can be derived from the its
(SRAM_C1, SRAM_C2, SRAM_L1, SRAM_L2, and right pixel’s comparison value of “c” and “a”. In Fig. 13,
SRAM_T) is also accomplished in this stage. besides 32 fixed comparators, 12 configurable comparators are
Stage 1(DF_1): Calculate the threshold value β, tc, the also applied when the pixels could not reuse the data from its
condition dE, dEp, dEq used for the decision of filter on/off, neighboring pixel at the boundary of these 32 pixels. With the
strong and weak filter. proposed configurable comparator array and data reuse, the
Stage 2(DF_2): According to the BS, tc, dE, dEp, dEq number of comparator is reduced to 17% (32 fixed comparators
obtained from the above stage, different taps are applied to +12 configurable comparators with proposed method, versus
filter the pixels across the current boundary. And the detailed 32 comparators × 2 (one pixel needs two comparators)×4 (four
filtering process of both the strong filter and the weak filter can EO classes) without proposed method).
be found in HEVC [3]. Five right column lines and four bottom lines for Y
Stage 3(DF_3): Write the filtered data back to on-chip component, meanwhile three right column lines and two
SRAM to be further filtered, or to the external memory to be bottom lines for Cb&Cr component should not be collected the
referenced for the other modules of the video encoder. statistics information in a LCU, detailed in [13]. Owing to the
The pre-DF pixels would be accessed and modified both in irregularity in a 32x32 pixels’ region statistics collection, we
filtering on vertical edges and horizontal edges in DF phase. summarize eight structures in Fig. 14. The statistics
These pixels after filtering on horizontal edges would be information of the pixels in red should be collected in a
fetched by SAO statistics collection phase. specified category or band in Fig.14, according to equation (6).
As shown in Fig. 12, because of that EO should refer its Meanwhile, 32 pixels’ difference between its original value and
neighboring pixels to obtain category selection; only 32 pixels’ post-DF value will be added to further fast distortion estimation.
information is collected inside the region marked with If the addition process of these 32 pixels’ difference is done in
solid-line. These pixels inside the dash-line region are filtered one cycle, this cycle should be very time-consuming, which
by DF. The pixels of yellow circles come from H line buffer of restricts the highest working frequency capability of the chip.
post-DF, while the pixels of green circles come from V line The proposed fragmentation adder scheme divides these 32
buffer. pixels into 12 regions, as shown in Fig. 14. The pixels’
Owing to that all these 32 pixels’ information would be difference of each region would be added in stage 0 of SAO
collected by EO_0, EO_1, EO_2, EO_3 and BO, an interlaced pipeline. Then these 12 regions are added selectively in stage 1
pipeline scheme, coupled with a configurable two-stage SAO of SAO. The proposed fragmentation adder scheme

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9

accomplishes a good tradeoff between high-throughput and TABLE III


COMPARISON WITH PREVIOUS DESIGNS
maximum working frequency capability of the chip.
With the proposed scheme, the DF phase and SAO statistics FTI+SBE method
collection phase take 518 cycles in a LCU, shown in Fig.10. Class Sequence BD-rate(%)
AI LD RA
The rest of the fast distortion estimation would be described in Traffic 0.1 0.1 0.0
Section IV. Class A PeopleOnStreet 0.2 0.2 0.2
[2560x1600] Nebuta 0.0 0.2 0.1
IV. SIMPLIFIED FAST RATE ESTIMATION SteamLocomotive 0.1 1.1 0.5
Kimono 0.1 0.2 0.1
In HM9.0, λ for Y and Cb&Cr are two floating-point ParkScene 0.1 0.2 0.2
Class B
numbers according to equation (7), which are derived from two Cactus 0.0 0.2 0.1
[1920x1080]
BasketballDrive 0.1 0.1 0.3
exponential function of the quantization parameter (QP). And
BQTerrace 0.0 0.2 0.2
floating-point number multiplication is a both time and area BasketballDrill 0.1 0.5 0.0
consuming in hardware design. In order to simplify the Class C BQMall 0.0 0.1 0.0
multiplication, two look-up tables are utilized to generate the λ [832x480] PartyScene 0.0 0.0 0.1
RaceHorses 0.0 0.1 0.2
for Y and Cb&Cr respectively where QP is the integer index
BasketballPass 0.1 0.1 0.1
ranging from 0 to 51 , and the values of λ are all integers in the Class D BQSquare 0.1 0.3 0.4
tables (floating to integer method, FTI). [416x240] BlowingBubbles 0.0 0.1 0.0
Besides, we also propose a simplified bitrate (R) estimation RaceHorses 0.0 0.2 0.0
FourPeople 0.1 0.3 0.0
method (SBE) of rate-distortion cost calculation, according to Class E
Johnny 0.1 0.0 0.1
equation (7).The value of R is derived from entropy coding [1280x720]
KristenAndSara 0.0 0.4 0.2
(coding a specified SAO type and the four offsets of this type) Average 0.1 0.2 0.1
in HM9.0. The calculation of R takes at least 8 cycles according
TABLE IV
to the hardware architecture of CABAC proposed in [12] and HARDWARE IMPLEMENTATION RESULT
[13], which is a time-consuming process. Here, we propose a
simplified bitrate estimation method, which employs a linear Gate count
Purpose
(NAND2)
model to derive the value of R in a cycle. Deblocking filter 30.3K
N randomly generated data 𝑥1 , 𝑥2 … 𝑥𝑁 are used as the input SAO statistics collection 55.9K
to the simplified bitrate estimation function with the SAO mode decision 17.1K
Total 103.3K
corresponding output 𝑦1 , 𝑦2 … 𝑦𝑁 . In our problem, 𝑥𝑖 (𝑖 =
1,2 ⋯ 𝑁 ) is the sum of absolute values of several offsets in a TABLE V
specified SAO type, and 𝑦𝑖 (𝑖 = 1,2 ⋯ 𝑁 ) is the estimated bits HARDWARE IMPLEMENTATION RESULT
of side information of the corresponding SAO type. We assume Technology TSMC 65nm CMOS
a linear model on this I/O relationship to reduce the cost of Gate count 103.3K
hardware design and improve the throughput significantly, On-chip SRAM 4.2KB
On-chip buffer 0.6KB
where
Working frequency 200MHz
𝑦1 𝑥1
𝑦2 𝑥2 𝑅𝐶𝑏&𝐶𝑟_𝐵𝑂 = ∑7𝑖=0|𝑜𝑓𝑓𝑠𝑒𝑡𝑖 | + 25 (17)
[ ⋮ ] = [ ⋮ ]𝑎 + 𝑏 (11)
𝑦𝑁 𝑥𝑁 𝑅𝑌&𝐶𝑏&𝐶𝑟 = 3 ∑11
𝑖=0|𝑜𝑓𝑓𝑠𝑒𝑡𝑖 | +1 (18)
which can be represented concisely as follows: 𝑅𝑚𝑒𝑟𝑔𝑖𝑛𝑔 = 4 (19)
𝒚 = [𝒙 𝟏𝑁×1 ]𝜽 (12) The performance of the proposed FTI+SBE method is
evaluated in terms of the change of average Bjontegaard Delta
where 𝒚 = [𝑦1 𝑦2 … 𝑦𝑁 ]𝑇 , 𝒙 = [𝑥1 𝑥2 … 𝑥𝑁 ]𝑇 , 𝟏𝑁×1 is an all
rate (BD-rate) [15]. The performance gain or loss is measured
one vector of size N×1, and 𝜽 = (𝑎 𝑏)𝑇 , which denotes the
with the respect to the HEVC reference software platform
parameters of the linear model.
(HM9.0). The experiments are carried for “All Intra-Main
Least square technique [14] is applied to estimate 𝜽, (AI)”, “Low Delay-Main (LD)” and “Random Access-Main
𝜽 = ([𝒙 𝟏𝑁×1 ]𝐻 [𝒙 𝟏𝑁×1 ])−1 [𝒙 𝟏𝑁×1 ]𝐻 𝒚 (13) (RA)” settings as stipulated by the common condition proposed
in [16]. The configuration files are provided in the common
which could be solved easily by a linear equation in standard software package of HM9.0. QP values of 22, 27, 32 and 37
manner. With the above method, we obtain the following cover a broad range of qualities and bit rates.
equation for R estimation: Table III shows the experimental results of the proposed
𝑅𝑌_𝐸𝑂 = ∑3𝑖=0|𝑜𝑓𝑓𝑠𝑒𝑡𝑖 | + 13 (14) FTI+SBE scheme as compared to HM9.0 tested on 20
sequences. It is observed that BD-rate increment is less than
𝑅𝑌_𝐵𝑂 = ∑3𝑖=0|𝑜𝑓𝑓𝑠𝑒𝑡𝑖 | + 16 (15) 0.2% for all sequences of “AI” setting which is negligible, with
𝑅𝐶𝑏&𝐶𝑟_𝐸𝑂 = ∑7𝑖=0|𝑜𝑓𝑓𝑠𝑒𝑡𝑖 | + 17 (16) best case of 0.0% increment and worst case of 0.2% increment.

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10

TABLE VI
COMPARISON WITH PREVIOUS DESIGNS

Processing
time
Gate On-chip
Implementation (cycle Supporting Frequency DE
Design Standard DF SAO Process count SRAM
Style counts/LCU) video format (MHz) (x103)
(NAND2) (byte)
1 LCU=16
MB
16VGA@30fps
[17] H.264 Gate-level √ × 260*16 0.13um 36.9K 672 225 4.0
(2560×1920)
1080p@30fps
[18] H.264 Real-chip √ × 204*16 0.18um 21.4K 512+64N 200 2.9
(1920×1088)
QFHD@30fps
[19] H.264 Real-chip √ × 100*16 0.13um 22.9K 416 98 10.9
(3840x2160)
1080p@30fps
[20] H.264 Gate-level √ × 243*16 0.18um 21.1K 864+8N 238 3.0
(1920×1088)
QFHD@30fps
[21] H.264 Gate-level √ × 192*16 0.18um 26.0K 512+8N 187 9.6
(3840x2160)
QFHD@30fps
[22] H.264 Gate-level √ × 112*16 0.18um 12.1K 256 109 20.6
(3840x2160)
QFHD@30fps
[23] H.264 Gate-level √ × 136*16 90nm 17.9K 1056 133 14.0
(3840x2160)
QFHD@60fps
[24] HEVC Gate-level √ × 2043 0.13um 17.6K 384 250 28.3
(3840x2160)
1080p@86fps
[25] HEVC FPGA √ × 2560 40nm 36.8K 832 108 4.8
(1920×1088)
DF:
DF:
(30.3+
UHD@40fps 43.6
Proposed HEVC Gate-level √ √ 558 65nm 0.148)K 4200 182
(7680x4320) SAO:
SAO:
18.2
73K
*N represents the frame width in pixels
And the average BD-rate increment for all sequences of “LD” sizes than the other designs [24] and [25], because that the
and “RA” settings is 0.2% and 0.1% respectively, which shows designs of HEVC DF of [24] and [25] are based on 16x16 CU.
that the FTI+SBE method can also be applied to inter coding. According to the processing time, the throughput is faster about
Meanwhile, the cycles for SAO mode decision are reduced by 3~7 times than other designs [17]-[25]. Our proposed hardware
87.5% in hardware design. The SAO mode decision phase takes architecture of combined DF and SAO is designed for HEVC
only 40 cycles for a LCU with the proposed FTI+SBE scheme intra encoder, neverthless the difference of DF between intra
according to Fig. 10. and inter coding is only the caclulation of BS value. The area of
BS caculation supporting inter coding is 148 gate count larger
V. IMPLEMENTATION RESULT AND COMPARISON than intra one. To give a fair comparsion, we add this part to the
We have implemented the proposed architecture in Verilog area cost of our design.
HDL and synthesized it targeted towards a TSMC 65nm CMOS In order to perform a fair comparison, we introduce a
cell library under a timing constraint of 200MHz. Table IV and normalized criterion called Design Efficiency (DE). It is
Table V show the hardware implementation result. Our design defined in equation (20).
requires 103.3K gate count and 4.2KB on-chip SRAM, and Format * Fps
0.6KB on-chip buffer. It is observed that the SAO collection DE  (20)
Gate Count
phase consumes more than 50% area of the whole design,
owing to the large usage of addition, subtraction and Where Format is the multiplication of width and height of the
comparator calculations according to Eq. (6). supporting video format in Table VI. Fps is the fps of the
The hardware performance comparison between our supporting video format. Gate count is the number of
proposed design and other designs is presented in Table VI. Our equivalent NAND2 of the design. It is well known that the
design is the only one can support UHD applications, which bigger DE has the greater design efficiency.
also works at a lower working frequency, compared with other The normalized result in Table VI shows that our design of
designs. In order to store not only the reconstructed pixels but DF is at least 54% more efficient than any previous works about
also the original pixels for SAO collection phase module, our DF. And our proposed SAO has a DE value of 18.2, which is
design needs more on-chip SRAM sizes than other designs. designed for HEVC intra encoder. The difference of SAO
Moreover, a quarter-LCU based hardware architecture is between intra and inter coding is that the SAO of inter coding
proposed, while the DF is based on a MB in H.264 standard need one more operation. The SAO should add the offset to the
design. The above two reasons causes to a huge consumption of output samples of DF according to the best SAO type.
on-chip SRAM sizes. Our design consumes a larger SRAM

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 11

VI. CONCLUSION [18] Ke Xu; Chiu-Sing Choy, "A Five-Stage Pipeline, 204 Cycles/MB,
Single-Port SRAM-Based Deblocking Filter for H.264/AVC," Circuits
This paper propose a hardware implementation on the and Systems for Video Technology, IEEE Transactions on , vol.18, no.3,
combined DF and SAO for HEVC intra encoder, which pp.363,374, March 2008.
[19] Yuan-Chun Lin; Youn-Long Lin, "A Two-Result-per-Cycle Deblocking
features a high throughput. With the featured balanced pipeline Filter Architecture for QFHD H.264/AVC Decoder," Very Large Scale
scheme, the maximum speed can reach 200MHz as shown in Integration (VLSI) Systems, IEEE Transactions on , vol.17, no.6,
Table V. As a result, our design can support UHD at 40fps with pp.838,843, June 2009.
[20] Liu, Tsu-Ming; Wen-Ping Lee; Chen-Yi Lee, "An In/Post-Loop
merely 182MHz working frequency, and it is also capable for Deblocking Filter With Hybrid Filtering Schedule," Circuits and Systems
higher resolution or low power applications. Meanwhile, such for Video Technology, IEEE Transactions on , vol.17, no.7, pp.937,943,
an implementation of DF and SAO benefits the capability of July 2007.
working frequency of the whole HEVC intra encoder. [21] NamThang Ta; JinSeon Youn; HuiGon Kim; JunRim Choi; Seung-Soo
Han, "Low-power High-throughput Deblocking Filter Architecture for
H.264/AVC," Electronic Computer Technology, 2009 International
REFERENCES Conference on , pp.627,631, 20-22.
[22] Nadeem, M.; Wong, S.; Kuzmanov, G.; Shabbir, A., "A high-throughput,
[1] Chih-Ming Fu, Ching-Yeh Chen, Yu-Wen Huang, Shawmin Lei,
area-efficient hardware accelerator for adaptive deblocking filter in
"Sample adaptive offset for HEVC," Multimedia Signal Processing
H.264/AVC," Embedded Systems for Real-Time Multimedia, 2009.
(MMSP), 2011 IEEE 13th International Workshop on , pp.1-5, 17-19 Oct.
ESTIMedia 2009. IEEE/ACM/IFIP 7th Workshop on, pp.18, 27, 15-16
2011.
Oct. 2009.
[2] Draft ITU-T recommendation and final draft international standard of
[23] Jinjia Zhou; Dajiang Zhou; Zhang, Hang; Hong, Yu; Peilin Liu; Goto, S.,
joint video specification, ITU-T Rec. H.264/AVC/ISO/IEC 14496-10
"A 136 cycles/MB, luma-chroma parallelized H.264/AVC deblocking
AVC, JCT-G050, 2003.
filter for QFHD applications," Multimedia and Expo, 2009. ICME 2009.
[3] T. Wiegand, B. Bross, W.-J. Han, J.-R. Ohm, and T. Wiegand, “High
IEEE International Conference on, pp.1134,1137, June 28 2009-July 3
Efficiency Video Coding (HEVC) text specification draft 9 (SoDIS),”
2009.
JCTVC-K1003, October 2012.
[24] Xin Ye; Dandan Ding; Lu Yu, "A cost-efficient hardware architecture of
[4] I.-K. Kim, K. McCann, K. Sugimoto, B. Bross, and W.-J. Han, “High
deblocking filter in HEVC," Visual Communications and Image
Efficiency Video Coding (HEVC) Test Model 7 (HM 7) Encoder
Processing Conference, 2014 IEEE , vol., no., pp.209,212, 7-10 Dec.
Description,” JCTVC-I1002, May 2012.
2014.
[5] Pourazad, M.T.; Doutre, C.; Azimi, M.; Nasiopoulos, P., "HEVC: The
[25] Ozcan, E.; Adibelli, Y.; Hamzaoglu, I., "A high performance deblocking
New Gold Standard for Video Compression: How Does HEVC Compare
filter hardware for high efficiency video coding," Consumer Electronics,
with H.264/AVC?," Consumer Electronics Magazine, IEEE , vol.1, no.3,
IEEE Transactions on , vol.59, no.3, pp.714,720, August 2013.
pp.36,46, July 2012.
[6] M. Yuen and H.R. Wu, “A survey of hybrid MC/DPCM/DCT video
coding distortions,” Journal of Signal Processing, vol.70, no.3,
pp.247-278, November, 1998.
Weiwei Shen received the B.S. and Ph.D.
[7] Norkin, A.; Bjontegaard, G.; Fuldseth, A.; Narroschke, M.; Ikeda, M.;
degrees in Microelectronics and Solid Electronics
Andersson, K.; Minhua Zhou; Van der Auwera, G., "HEVC Deblocking
from Fudan University, Shanghai, China, in 2010
Filter," Circuits and Systems for Video Technology, IEEE Transactions
and 2015 respectively. His research interests
on , vol.22, no.12, pp.1746,1754, Dec. 2012.
include VLSI design, algorithms and the VLSI
[8] Chih-Ming Fu; Alshina, E.; Alshin, A.; Yu-Wen Huang; Ching-Yeh Chen;
architectures for multimedia signal processing.
Chia-Yang Tsai; Chih-Wei Hsu; Shaw-Min Lei; Jeong-Hoon Park;
Woo-Jin Han, "Sample Adaptive Offset in the HEVC Standard," Circuits
and Systems for Video Technology, IEEE Transactions on , vol.22, no.12,
pp.1755,1764, Dec. 2012.
[9] Bossen, F.; Bross, B.; Suhring, K.; Flynn, D., "HEVC Complexity and
Implementation Analysis," Circuits and Systems for Video Technology,
IEEE Transactions on , vol.22, no.12, pp.1685,1696, Dec. 2012.
[10] Jiayi Zhu; Dajiang Zhou; Gang He; Goto, S., "A combined SAO and
Yibo Fan received the B.E. degree in
de-blocking filter architecture for HEVC video decoder," Image
electronics and engineering from Zhejiang
Processing (ICIP), 2013 20th IEEE International Conference on ,
University, Hangzhou, China, in 2003; the M.S.
pp.1967,1971, 15-18 Sept. 2013
degree in microelectronics from Fudan
[11] Y.-W. Huang, E. Alshina, I. S. Chong, W. Wan, and M. Zhou,
University, Shanghai, China, in 2006; and the
“Description of Core Experiment 1 (CE1): Sample adaptive offset
Ph.D. degree in engineering from Waseda
filtering,” JCTVC-H1101, February 2012.
University, Tokyo, Japan, in 2009. He was an
[12] Rui Song; Hongfei Cui; Yunsong Li; Xiao Song, "A five-stage pipeline
Assistant Professor with Shanghai Jiao Tong
design of binary arithmetic encoder in H.264/AVC," Signal &
University, Shanghai, from 2009 to 2010, and is
Information Processing Association Annual Summit and Conference
currently an Associate Professor with the College
(APSIPA ASC), 2012 Asia-Pacific , pp.1,4, 3-6 Dec. 2012.
of Microelectronics, Fudan University. His
[13] Rosa, V.; Max, L.; Bampi, S., "High performance architectures for the
research interests include image processing,
arithmetic encoder of the H.264/AVC CABAC entropy coder,"
video coding, and associated VLSI architecture.
Electronics, Circuits, and Systems (ICECS), 2010 17th IEEE
International Conference on, pp.383,386, 12-15 Dec. 2010.
[14] S. M. Kay, Fundamentals of Statistical Signal Processing, Volume I:
Estimation Theory, 1st ed. Prentice Hall, 1993.
Yufeng Bai received the B.S. degree in
[15] G. Bjontegaard, “Calculation of average PSNR differences between RD
electronics engineering from Sun Yat-sen
curves,” ITU-T SC16/Q6, VCEG-M33, Austin, Texas, pp.1-2, 2001.
University, Guanghzhou, China, in 2008. He is
[16] “HM Reference Software,” svn://hevc.kw.bbc.co.uk/svn/jctvc-hm/tags/
currently working towards his M.S. degree in
HM-9.0, 2013
microelectronics at Fudan University. His
[17] Cheng-An Chien; Hsiu-Cheng Chang; Jiun-In Guo, "A high throughput
research interests are video coding and its VLSI
in-loop de-blocking filter supporting H.264/AVC BP/MP/HP video
architecture design.
coding," Circuits and Systems, 2008. APCCAS 2008. IEEE Asia Pacific
Conference on, pp.312,315, Nov. 30 2008-Dec. 3 2008.

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 12

Leilei Huang received the B.S. degree in Cong Liu received the B.S. and M.S. degrees in
Microelectronics and Solid Electronics from Fudan Microelectronics and Solid Electronics from Fudan
University, Shanghai, China, in 2014. He is currently University, Shanghai, China, in 2011 and 2014
pursuing toward the M.S. degree in Microelectronics respectively. His research interests include VLSI
and Solid Electronics from Fudan University. His design, algorithms and the VLSI architectures for
research interests include VLSI design, algorithms and multimedia signal processing and design for test.
corresponding VLSI architectures for multimedia
signal processing.

Qing Shang received the B.S. and M.S. degrees in Xiaoyang Zeng (M’05) received the B.S. degree from
Microelectronics and solid Electronics from Fudan Xiangtan University, Xiangtan, China, in 1992, and
University, Shanghai, China, in 2011 and 2014 the Ph.D. degree from Changchun Institute of Optics,
respectively. His research interests include VSLI Fine Mechanics, and Physics, Chinese Academy of
design, algorithms and VLSI architectures for Sciences, Changchun, China, in 2001. From 2001 to
multimedia signal processing and design for test. 2003, he was a Postdoctoral Researcher with Fudan
University, Shanghai, China. Then, he joined the State
Key Lab of ASIC and System, Fudan University, as an
Associate Professor, where he is currently a Full
Professor and the Director. His research interests
include information security chip design,
system-on-chip platforms, and VLSI implementation of digital signal
processing and communication systems.

1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like