Shen 2016
Shen 2016
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1
I. INTRODUCTION Reconstruction IQ IT
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2
A. Motivation p3,1 p2,1 p1,1 p0,1 q0,1 q1,1 q2,1 q3,1 the second line
Among many algorithms adopted by HEVC, two in-loop
filters’ algorithm (DF and SAO) requires significant CPU times. p3,2 p2,2 p1,2 p0,2 q0,2 q1,2 q2,2 q3,2 the third line
According to [9], the DF and SAO cost about one-fifth CPU
times of the whole HEVC codec, which are two complicated p3,3 p2,3 p1,3 p0,3 q0,3 q1,3 q2,3 q3,3 the fourth line
modules in HEVC.
Moreover, beyond-HD format’s applications become more P block Q block
and more popular, which are at the expense of large external Fig. 3. Two adjacent 4x4 blocks
traffic. For example, the DF and SAO require the demand of
0.93Gbps, 0.58Gbps I/O bandwidth respectively at an TABLE I
DEFINITION OF BS VALUES
application of 2K x 1K@30fps (2560x1600), without any
Conditions BS value
data-reuse scheme. Thus a high throughput hardware
architecture for DF and SAO becomes a critical issue in HEVC At least one of the P or Q blocks is intra 2
codec design. At least one of the P or Q blocks has non-zero coded
1
residual coefficient and the boundary is a TU boundary
B. Design Approach
Absolute differences between MVs between the P and Q
Due to an increasing demand of high resolution applications, 1
blocks are >= 1 in units of integer pixels
the high data access between on-chip memory and external Motion-compensated prediction for the P and Q blocks
memory becomes even more critical. Considering the tradeoffs refers to different pictures or the number of MVs is 1
between on-chip memory area and external memory traffic, we different for the two blocks
present an interlaced pipeline to combine the DF with SAO on a Otherwise 0
quarter-LCU basis; the quarter-LCU is defined as a 32x32
pixels’ block. In the process of DF, a novel filter is suggested in simplified bitrate estimation method of rate-distortion cost
order to keep the same result on a picture basis based on the calculation is adopted to reduce the computational complexity
quarter-LCU structure. Meanwhile, we also propose an in the mode decision of SAO.
interlacing memory scheme to arrange the data in on-chip With these design approaches, a LCU can be filtered in 558
memory, and access the data in the process of both vertical and cycles, and our design can support UHD (7680×4320) at 40fps
horizontal filtering efficiently in the DF phase. applications at merely 182MHz working frequency.
In the process of statistics collection in SAO, the overall
number of comparators is reduced by 83% with our proposed C. Organization of the Paper
configurable comparator array. We also present a The rest of this paper is organized as follows. Section II
fragmentation adder scheme to balance the computational introduces the DF and SAO algorithm in HEVC. In Section III,
burden between pipeline stages of SAO. Meanwhile, a our proposed hardware architecture is described. Section IV
formulates the simplified bitrate estimation method of
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3
a a a Starting
band position
a c b c c c
b b b Signal four bands’ offsets from the starting band
Minimum Maximum
EO_0 EO_1 EO_2 EO_3 pixel value pixel value
horizontal vertical 135°diagonal 45°diagonal
Fig. 5. Four bands are grouped together and signal from the starting band
Fig. 4. Four 1-D directional patterns for EO classification
merge
Y
TABLE II up
LCU
SAMPLE CATEGORIES IN EDGE OFFSET
merge current Cb
Category Condition left LCU
LCU LCU
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4
Data controller
Statistics
SRAM_0
Difference Statistics
Deblocking filter SAO SRAM_1
date fetch data fetch Statistics
Fragmentation SRAM_2
Four-line adders Statistics
CU filter core Deblocking SRAM_3
information BS, β, tc filter Configurable
Calculation data store comparator SAO side
array Fast distortion information
estimation to be coded
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5
defined in the equation (6). Next, the delta rate-distortion cost is vertical boundaries and sixteen horizontal boundaries should be
defined in the following equation: filtered by DF, the number of 4x4 block memory reading
accesses is 867 (=16x17+2x8x9+17x17+2x9x9), and the
∆𝐽 = ∆𝐷 + 𝜆𝑅 (7)
number of 4x4 block memory writing accesses is 867 without
In (7), λ is the Lagrange multiplier, and R represents the data-reuse scheme. Moreover, the SAO requires both the
estimated bits of side information for a specified SAO type. The post-DF and original pixels, the number of 4x4 block memory
calculation of λ is performed by entropy coding which is a reading accesses is 1048 (=2x(17x17+2x9x9+73)), while the
time-consuming process in HM 9.0. A simplified bitrate number of 4x4 block memory writing accesses is 0 (the SAO
estimation method is proposed to reduce the computation just send the selected type and offsets to the system) without
complexity, which is elaborated in Section IV. data-reuse scheme. Em can be calculated as the following
equation:
III. PROPOSED HARDWARE ARCHITECTURE
(867 1048) (867 0)
Fig. 7 shows an overall proposed hardware architecture of Em 7.2 (9)
16 16 8 8 8 8
combined DF and SAO. We adopt an interlaced pipeline
architecture to speed up the combined DF and SAO. The whole It is observed that each pixel would be access about eight
process is partitioned into three phases: DF, SAO statistics times to complete the DF and SAO, which causes a huge
collection and SAO mode decision. overhead of the external memory. In order to reduce the
The whole design is based on quarter-LCU (32x32 pixels for external memory bandwidth, the on-chip memory scheme is a
Y, 16x16pixels for Cb and Cr) scheme which has the benefit of well-known method to store the reuse data in chip. Here, a
on-chip SRAM (which stores the reconstructed and original novel filter order scheme is proposed to reuse the temporary
pixels) area reduction by about 75% compared to the LCU basis, data after vertical boundary filtering for further horizontal
without throughput loss. Owing to that the SAO is performed boundary filtering in DF phase, cooperating with the proposed
on a LCU basis in HM 9.0, a z-scan order (shown in Fig. 8) is on-chip memory organization. Moreover, the combined DF and
proposed to keep the same results in picture level (from SAO hardware architecture can reuse the post-DF pixels for
quarter-LCU_0 to quarter-LCU_3 in a LCU). SAO statistics collection phase.
In DF phase, a four-stage pipeline is adopted to enhance the With the above methods, the block size is set as a
throughput on the basis of a four-line unit, while a two-stage quarter-LCU size (32x32 pixels) according to equation (8). The
pipeline is also adopted in SAO statistics collection phase. number of 4x4 block memory reading accesses is 112
During the DF process, the statistics information for deriving (=8x9+2x4x5), and the number of 4x4 block memory writing
offset and fast distortion estimation of SAO is collected after accesses is 16 (=8+2x4) of DF. Meanwhile, the SAO just reads
horizontal edges filtering. Consequently, an interlaced pipeline the original pixels from the external memory, and the number
scheme for DF and SAO is proposed to avoid fetching any of 4x4 block memory reading accesses is 112 (=8x9+2x4x5).
post-DF pixels from SRAM and improve the throughput And Em for the above data-reuse methods can be expressed as:
significantly.
(112 112) (16 0)
A. Analysis of Data-Reuse for the Combined DF and SAO Em 2.5 (10)
88 4 4 4 4
In order to clarify the advantage of the combined DF and
SAO hardware architecture, an external memory access factor It means that the total number of 4x4 block external memory
Em is defined to evaluate the efficiency of the external memory accesses dramatically reduces to 35%, which greatly eases the
access for DF and SAO, the Em is expressed as external memory traffic. Moreover, the combined DF and SAO
hardware architecture can eliminate the SRAMs of storing
Em Em Em post-DF pixels for SAO. The SAO can utilize the post-DF
r w
the number of memory reading accesses from the bus of a block pixels directly without using SRAMs for storing the post-DF
pixels.
the number of 4 4 blocks of a block
the number of memory writing accesses to the bus of a block (8) B. Bandwidth Utilization Discussion
the number of 4 4 blocks of a block The external memory access factor Em is a theoretical
total number of memory accesses of a block number to evaluate the efficiency of the external memory
the number of 4 4 blocks of a block access. Furthermore, the bandwidth utilization of external
memory is an extremely important issue in a real video encoder
where Emr and Emw represent the average number of memory design, especially for DRAM.
accesses from and to the bus per 4x4 block, respectively. Em DRAM, such as SDRAM or DDR1/2/3, is widely used as
stands for the average number of external memory accesses per external memory, which prefers block-based access. In other
4x4 block for DF and SAO. It is well known that the smaller Em words, if the desired data is tiny and fractional distributed in
has the greater reduction of the external memory bandwidth. external memory, then the pre-fetched data mechanism of DDR
In the general hardware design, the DF and SAO are based would be wasted, resulting in a very low utilization of external
on a LCU level, so the block size is set as a LCU size (64x64 memory bandwidth. The improvements of the DRAM
pixels) according to equation (8). Because of the sixteen
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6
v7 v8 C95 C94
Vertical
edges to be DF and SAO
filtered statistics collection
for Cb&Cr SAO
DF and SAO DF and SAO DF and SAO mode
DF_0 DF_1 DF_2 DF_3 DF_0 DF_1 DF_2 DF_3 DF_0 DF_1 DF_2 DF_3
statistics statistics statistics decision
collection for collection for collection for
quarter_LCU_1 quarter_LCU_2 quarter_LCU_3
SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0 SAO_0
SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1 SAO_1
bandwidth utilization should be a serious concern in a real and then the horizontal edges are filtered with samples
video encoder design. Thus, great efforts should be put on how modified by DF of vertical edges as input. The vertical edges
to join fractional data into continuous data chain stored in are filtered starting with the edge on the left-hand through the
external memory. edges towards the right-hand in their geometrical order. The
In our external memory system, we use DDR2 (data width is horizontal edges are filtered starting with the edge on the
32bits, burst length is 4,equal to a 4x4 block one access time), top-hand through the edges towards the bottom-hand in their
and the pixels in DDR2 are ordered as a quarter-LCU as a big geometrical order. This is the basic filter order principle of DF
chain, in which the pixels are ordered as 4x4 blocks. With our in HEVC [3].
proposed data-reuse scheme, we only access the whole current 2) Boundary Level Filter Order
quarter-LCU and a row of 4x4 blocks from the neighboring top The boundaries involved in filtering on a quarter-LCU basis
quarter-LCU. And the data in on-chip SRAMs are stored with are labeled as red lines in Fig. 9. The vertical boundaries
4x4 blocks as a basic unit. So we always access consecutive including {v1, v2, v3, v4} are filtered from left to right (if any),
integral multiple of 4x4 blocks’ data, and there is no overhead and the horizontal boundaries including {h1, h2, h3, h4} are
in external memory access. filtered from top to bottom (if any) for component Y. And the
filter order of component Cb and Cr are similar to the one of
C. A Novel Filter Order Scheme component Y.
In this section, three filter order levels, named picture level, 3) Four-line Unit Level Filter Order
boundary level and four-line unit level, are defined to describe Given that the four-line unit is composed of two adjacent 4x4
our proposed filter order. blocks in Fig. 3. Each boundary consists of several four-line
1) Picture Level Filter Order units. The vertical boundaries are filtered from top four-line
The vertical edges in a reconstructed picture are filtered first,
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7
pixels involved in
SAO statistics
collection
deblocking filtered
pixels
horizontal four-line
unit edge involved in
deblocking filter
Fig. 12. Proposed pixels’ selection of combined deblocking filter and SAO
boundary obeys the four-line unit level filter order. Note that, C C C C
four-line unit {N1, N2, N3, N4, N5, N6, N7, N8} would not be
filtered in current quarter-LCU by explicitly setting the BS Fig. 13. Proposed configurable comparator array
value to 0, it is expected that the current quarter-LCU is on the
right boundary of the entire picture. These lines would be SRAM_L2 are from the left neighboring quarter-LCU.
filtered in right quarter-LCU which is on the right of current Meanwhile, the pixels in SRAM_T come from the top
quarter-LCU, corresponding to {P1, P2, P3, P4, P5, P6, P7, P8} neighboring quarter-LCU and left-top neighboring
in the right quarter-LCU to keep the same result in HM 9.0. quarter-LCU. Each square in Fig. 9 stands for a 4x4 pixels of
Owing to that the SAO is performed on LCU basis in HM 9.0, luminance or chrominance component. For example, the grey
a z-scan order (shown in Fig. 8) is proposed to keep the same squares from C0 to C95 belong to SRAM_C1, while the white
results in picture level (from quarter-LCU_0 to quarter-LCU_3 squares from C1 to C94 belong to SRAM_C2. With our
in a LCU). Once the DF and SAO statistics collection phase is proposed interlacing memory organization, two 4x4 pixel
completed of quarter-LCU_0, quarter-LCU_1, quarter-LCU_2 blocks on both sides of every boundary always come from
and quarter-LCU_3 (shown in Fig. 10), the SAO mode decision different memory modules. As a result, all pixels can be easily
phase for a LCU would be invoked. accessed from the SRAMs on the vertical and horizontal
filtering operations.
D. Proposed Memory Organization A two-port SRAM can support one read and one write at the
An on-chip memory is used to reduce the I/O bandwidth same clock cycle. In our proposed design, we utilize five
between the chip and the system. The main challenge is to two-port SRAMs for reconstructed pixels. These SRAMs can
properly arrange the data in memory modules in order to access also be used to store the immediate pixels for horizontal edges
the data smoothly from the memory modules to the in-loop filtering after vertical edges filtering. Therefore, no more
filter core, on both vertical and horizontal boundaries. We on-chip memories are needed for the DF.
present an interlacing memory organization to access the data Due to the directional patterns of EO in SAO statistics
effectively in the processing of both vertical and horizontal collection phase, another V and H line buffers are needed to
filtering for the DF. store some post-DF pixels for the category selection (shown in
Our approach is to divide the on-chip memory of Fig. 9, each yellow and green circle stand for a pixel). Another
reconstructed pixels (used for the DF) into five modules two SRAMs (SRAM_O1, SRAM_O2) and V, H line buffers
(SRAM_C1, SRAM_C2, SRAM_L1, SRAM_L2, and are also adopted to store the original pixels for SAO statistics
SRAM_T), shown in Fig. 9. SRAM_C1 and SRAM_C2 store collection. The organization of these memories is similar to the
the pixels of the current quarter-LCU; those in SRAM_L1 and
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10
TABLE VI
COMPARISON WITH PREVIOUS DESIGNS
Processing
time
Gate On-chip
Implementation (cycle Supporting Frequency DE
Design Standard DF SAO Process count SRAM
Style counts/LCU) video format (MHz) (x103)
(NAND2) (byte)
1 LCU=16
MB
16VGA@30fps
[17] H.264 Gate-level √ × 260*16 0.13um 36.9K 672 225 4.0
(2560×1920)
1080p@30fps
[18] H.264 Real-chip √ × 204*16 0.18um 21.4K 512+64N 200 2.9
(1920×1088)
QFHD@30fps
[19] H.264 Real-chip √ × 100*16 0.13um 22.9K 416 98 10.9
(3840x2160)
1080p@30fps
[20] H.264 Gate-level √ × 243*16 0.18um 21.1K 864+8N 238 3.0
(1920×1088)
QFHD@30fps
[21] H.264 Gate-level √ × 192*16 0.18um 26.0K 512+8N 187 9.6
(3840x2160)
QFHD@30fps
[22] H.264 Gate-level √ × 112*16 0.18um 12.1K 256 109 20.6
(3840x2160)
QFHD@30fps
[23] H.264 Gate-level √ × 136*16 90nm 17.9K 1056 133 14.0
(3840x2160)
QFHD@60fps
[24] HEVC Gate-level √ × 2043 0.13um 17.6K 384 250 28.3
(3840x2160)
1080p@86fps
[25] HEVC FPGA √ × 2560 40nm 36.8K 832 108 4.8
(1920×1088)
DF:
DF:
(30.3+
UHD@40fps 43.6
Proposed HEVC Gate-level √ √ 558 65nm 0.148)K 4200 182
(7680x4320) SAO:
SAO:
18.2
73K
*N represents the frame width in pixels
And the average BD-rate increment for all sequences of “LD” sizes than the other designs [24] and [25], because that the
and “RA” settings is 0.2% and 0.1% respectively, which shows designs of HEVC DF of [24] and [25] are based on 16x16 CU.
that the FTI+SBE method can also be applied to inter coding. According to the processing time, the throughput is faster about
Meanwhile, the cycles for SAO mode decision are reduced by 3~7 times than other designs [17]-[25]. Our proposed hardware
87.5% in hardware design. The SAO mode decision phase takes architecture of combined DF and SAO is designed for HEVC
only 40 cycles for a LCU with the proposed FTI+SBE scheme intra encoder, neverthless the difference of DF between intra
according to Fig. 10. and inter coding is only the caclulation of BS value. The area of
BS caculation supporting inter coding is 148 gate count larger
V. IMPLEMENTATION RESULT AND COMPARISON than intra one. To give a fair comparsion, we add this part to the
We have implemented the proposed architecture in Verilog area cost of our design.
HDL and synthesized it targeted towards a TSMC 65nm CMOS In order to perform a fair comparison, we introduce a
cell library under a timing constraint of 200MHz. Table IV and normalized criterion called Design Efficiency (DE). It is
Table V show the hardware implementation result. Our design defined in equation (20).
requires 103.3K gate count and 4.2KB on-chip SRAM, and Format * Fps
0.6KB on-chip buffer. It is observed that the SAO collection DE (20)
Gate Count
phase consumes more than 50% area of the whole design,
owing to the large usage of addition, subtraction and Where Format is the multiplication of width and height of the
comparator calculations according to Eq. (6). supporting video format in Table VI. Fps is the fps of the
The hardware performance comparison between our supporting video format. Gate count is the number of
proposed design and other designs is presented in Table VI. Our equivalent NAND2 of the design. It is well known that the
design is the only one can support UHD applications, which bigger DE has the greater design efficiency.
also works at a lower working frequency, compared with other The normalized result in Table VI shows that our design of
designs. In order to store not only the reconstructed pixels but DF is at least 54% more efficient than any previous works about
also the original pixels for SAO collection phase module, our DF. And our proposed SAO has a DE value of 18.2, which is
design needs more on-chip SRAM sizes than other designs. designed for HEVC intra encoder. The difference of SAO
Moreover, a quarter-LCU based hardware architecture is between intra and inter coding is that the SAO of inter coding
proposed, while the DF is based on a MB in H.264 standard need one more operation. The SAO should add the offset to the
design. The above two reasons causes to a huge consumption of output samples of DF according to the best SAO type.
on-chip SRAM sizes. Our design consumes a larger SRAM
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 11
VI. CONCLUSION [18] Ke Xu; Chiu-Sing Choy, "A Five-Stage Pipeline, 204 Cycles/MB,
Single-Port SRAM-Based Deblocking Filter for H.264/AVC," Circuits
This paper propose a hardware implementation on the and Systems for Video Technology, IEEE Transactions on , vol.18, no.3,
combined DF and SAO for HEVC intra encoder, which pp.363,374, March 2008.
[19] Yuan-Chun Lin; Youn-Long Lin, "A Two-Result-per-Cycle Deblocking
features a high throughput. With the featured balanced pipeline Filter Architecture for QFHD H.264/AVC Decoder," Very Large Scale
scheme, the maximum speed can reach 200MHz as shown in Integration (VLSI) Systems, IEEE Transactions on , vol.17, no.6,
Table V. As a result, our design can support UHD at 40fps with pp.838,843, June 2009.
[20] Liu, Tsu-Ming; Wen-Ping Lee; Chen-Yi Lee, "An In/Post-Loop
merely 182MHz working frequency, and it is also capable for Deblocking Filter With Hybrid Filtering Schedule," Circuits and Systems
higher resolution or low power applications. Meanwhile, such for Video Technology, IEEE Transactions on , vol.17, no.7, pp.937,943,
an implementation of DF and SAO benefits the capability of July 2007.
working frequency of the whole HEVC intra encoder. [21] NamThang Ta; JinSeon Youn; HuiGon Kim; JunRim Choi; Seung-Soo
Han, "Low-power High-throughput Deblocking Filter Architecture for
H.264/AVC," Electronic Computer Technology, 2009 International
REFERENCES Conference on , pp.627,631, 20-22.
[22] Nadeem, M.; Wong, S.; Kuzmanov, G.; Shabbir, A., "A high-throughput,
[1] Chih-Ming Fu, Ching-Yeh Chen, Yu-Wen Huang, Shawmin Lei,
area-efficient hardware accelerator for adaptive deblocking filter in
"Sample adaptive offset for HEVC," Multimedia Signal Processing
H.264/AVC," Embedded Systems for Real-Time Multimedia, 2009.
(MMSP), 2011 IEEE 13th International Workshop on , pp.1-5, 17-19 Oct.
ESTIMedia 2009. IEEE/ACM/IFIP 7th Workshop on, pp.18, 27, 15-16
2011.
Oct. 2009.
[2] Draft ITU-T recommendation and final draft international standard of
[23] Jinjia Zhou; Dajiang Zhou; Zhang, Hang; Hong, Yu; Peilin Liu; Goto, S.,
joint video specification, ITU-T Rec. H.264/AVC/ISO/IEC 14496-10
"A 136 cycles/MB, luma-chroma parallelized H.264/AVC deblocking
AVC, JCT-G050, 2003.
filter for QFHD applications," Multimedia and Expo, 2009. ICME 2009.
[3] T. Wiegand, B. Bross, W.-J. Han, J.-R. Ohm, and T. Wiegand, “High
IEEE International Conference on, pp.1134,1137, June 28 2009-July 3
Efficiency Video Coding (HEVC) text specification draft 9 (SoDIS),”
2009.
JCTVC-K1003, October 2012.
[24] Xin Ye; Dandan Ding; Lu Yu, "A cost-efficient hardware architecture of
[4] I.-K. Kim, K. McCann, K. Sugimoto, B. Bross, and W.-J. Han, “High
deblocking filter in HEVC," Visual Communications and Image
Efficiency Video Coding (HEVC) Test Model 7 (HM 7) Encoder
Processing Conference, 2014 IEEE , vol., no., pp.209,212, 7-10 Dec.
Description,” JCTVC-I1002, May 2012.
2014.
[5] Pourazad, M.T.; Doutre, C.; Azimi, M.; Nasiopoulos, P., "HEVC: The
[25] Ozcan, E.; Adibelli, Y.; Hamzaoglu, I., "A high performance deblocking
New Gold Standard for Video Compression: How Does HEVC Compare
filter hardware for high efficiency video coding," Consumer Electronics,
with H.264/AVC?," Consumer Electronics Magazine, IEEE , vol.1, no.3,
IEEE Transactions on , vol.59, no.3, pp.714,720, August 2013.
pp.36,46, July 2012.
[6] M. Yuen and H.R. Wu, “A survey of hybrid MC/DPCM/DCT video
coding distortions,” Journal of Signal Processing, vol.70, no.3,
pp.247-278, November, 1998.
Weiwei Shen received the B.S. and Ph.D.
[7] Norkin, A.; Bjontegaard, G.; Fuldseth, A.; Narroschke, M.; Ikeda, M.;
degrees in Microelectronics and Solid Electronics
Andersson, K.; Minhua Zhou; Van der Auwera, G., "HEVC Deblocking
from Fudan University, Shanghai, China, in 2010
Filter," Circuits and Systems for Video Technology, IEEE Transactions
and 2015 respectively. His research interests
on , vol.22, no.12, pp.1746,1754, Dec. 2012.
include VLSI design, algorithms and the VLSI
[8] Chih-Ming Fu; Alshina, E.; Alshin, A.; Yu-Wen Huang; Ching-Yeh Chen;
architectures for multimedia signal processing.
Chia-Yang Tsai; Chih-Wei Hsu; Shaw-Min Lei; Jeong-Hoon Park;
Woo-Jin Han, "Sample Adaptive Offset in the HEVC Standard," Circuits
and Systems for Video Technology, IEEE Transactions on , vol.22, no.12,
pp.1755,1764, Dec. 2012.
[9] Bossen, F.; Bross, B.; Suhring, K.; Flynn, D., "HEVC Complexity and
Implementation Analysis," Circuits and Systems for Video Technology,
IEEE Transactions on , vol.22, no.12, pp.1685,1696, Dec. 2012.
[10] Jiayi Zhu; Dajiang Zhou; Gang He; Goto, S., "A combined SAO and
Yibo Fan received the B.E. degree in
de-blocking filter architecture for HEVC video decoder," Image
electronics and engineering from Zhejiang
Processing (ICIP), 2013 20th IEEE International Conference on ,
University, Hangzhou, China, in 2003; the M.S.
pp.1967,1971, 15-18 Sept. 2013
degree in microelectronics from Fudan
[11] Y.-W. Huang, E. Alshina, I. S. Chong, W. Wan, and M. Zhou,
University, Shanghai, China, in 2006; and the
“Description of Core Experiment 1 (CE1): Sample adaptive offset
Ph.D. degree in engineering from Waseda
filtering,” JCTVC-H1101, February 2012.
University, Tokyo, Japan, in 2009. He was an
[12] Rui Song; Hongfei Cui; Yunsong Li; Xiao Song, "A five-stage pipeline
Assistant Professor with Shanghai Jiao Tong
design of binary arithmetic encoder in H.264/AVC," Signal &
University, Shanghai, from 2009 to 2010, and is
Information Processing Association Annual Summit and Conference
currently an Associate Professor with the College
(APSIPA ASC), 2012 Asia-Pacific , pp.1,4, 3-6 Dec. 2012.
of Microelectronics, Fudan University. His
[13] Rosa, V.; Max, L.; Bampi, S., "High performance architectures for the
research interests include image processing,
arithmetic encoder of the H.264/AVC CABAC entropy coder,"
video coding, and associated VLSI architecture.
Electronics, Circuits, and Systems (ICECS), 2010 17th IEEE
International Conference on, pp.383,386, 12-15 Dec. 2010.
[14] S. M. Kay, Fundamentals of Statistical Signal Processing, Volume I:
Estimation Theory, 1st ed. Prentice Hall, 1993.
Yufeng Bai received the B.S. degree in
[15] G. Bjontegaard, “Calculation of average PSNR differences between RD
electronics engineering from Sun Yat-sen
curves,” ITU-T SC16/Q6, VCEG-M33, Austin, Texas, pp.1-2, 2001.
University, Guanghzhou, China, in 2008. He is
[16] “HM Reference Software,” svn://hevc.kw.bbc.co.uk/svn/jctvc-hm/tags/
currently working towards his M.S. degree in
HM-9.0, 2013
microelectronics at Fudan University. His
[17] Cheng-An Chien; Hsiu-Cheng Chang; Jiun-In Guo, "A high throughput
research interests are video coding and its VLSI
in-loop de-blocking filter supporting H.264/AVC BP/MP/HP video
architecture design.
coding," Circuits and Systems, 2008. APCCAS 2008. IEEE Asia Pacific
Conference on, pp.312,315, Nov. 30 2008-Dec. 3 2008.
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2016.2532606, IEEE
Transactions on Multimedia
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 12
Leilei Huang received the B.S. degree in Cong Liu received the B.S. and M.S. degrees in
Microelectronics and Solid Electronics from Fudan Microelectronics and Solid Electronics from Fudan
University, Shanghai, China, in 2014. He is currently University, Shanghai, China, in 2011 and 2014
pursuing toward the M.S. degree in Microelectronics respectively. His research interests include VLSI
and Solid Electronics from Fudan University. His design, algorithms and the VLSI architectures for
research interests include VLSI design, algorithms and multimedia signal processing and design for test.
corresponding VLSI architectures for multimedia
signal processing.
Qing Shang received the B.S. and M.S. degrees in Xiaoyang Zeng (M’05) received the B.S. degree from
Microelectronics and solid Electronics from Fudan Xiangtan University, Xiangtan, China, in 1992, and
University, Shanghai, China, in 2011 and 2014 the Ph.D. degree from Changchun Institute of Optics,
respectively. His research interests include VSLI Fine Mechanics, and Physics, Chinese Academy of
design, algorithms and VLSI architectures for Sciences, Changchun, China, in 2001. From 2001 to
multimedia signal processing and design for test. 2003, he was a Postdoctoral Researcher with Fudan
University, Shanghai, China. Then, he joined the State
Key Lab of ASIC and System, Fudan University, as an
Associate Professor, where he is currently a Full
Professor and the Director. His research interests
include information security chip design,
system-on-chip platforms, and VLSI implementation of digital signal
processing and communication systems.
1520-9210 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.