0% found this document useful (0 votes)
11 views12 pages

A Reconfigurable Multiple Transform Selection Architecture For VVC

Uploaded by

sonali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

A Reconfigurable Multiple Transform Selection Architecture For VVC

Uploaded by

sonali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

658 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO.

5, MAY 2023

A Reconfigurable Multiple Transform Selection


Architecture for VVC
Zhijian Hao , Heming Sun , Member, IEEE, Guoqing Xiang , Peng Zhang,
Xiaoyang Zeng, Member, IEEE, and Yibo Fan
Abstract— Video coding plays an important role in the highly and omnidirectional video, dramatically leads to an increase
information-based world as videos contribute the largest part in traffic over the internet. This brings new challenges to
of network traffic. The latest video coding standard Versatile the video compression community to further improve coding
Video Coding (VVC) introduces a new transform scheme multiple
transform selection (MTS), which brings considerable coding efficiency. Since October 2015, International Organization
gains at the expense of high coding complexity. In this article, for Standardization/International Electrotechnical Commission
we propose a reconfigurable MTS architecture that supports (ISO/IEC) Moving Picture Experts Group (MPEG) and Inter-
all transform types in VVC with square and rectangular sizes national Telecommunication Union Telecommunication Stan-
ranging from 4 × 4 to 32 × 32. Firstly, we explore the dardization Sector (ITU-T) Video Coding Experts Group
features of three types of transform matrices and extract the
features that are beneficial to designing a unified architecture. (VCEG) have been cooperating as the Joint Video Exploration
Then, we present an improved calculation scheme for general Team (JVET) to explore the state-of-the-art techniques for
transforms, where the transform matrix is decomposed into the next generation of video coding standard with capability
two simpler matrices to increase the similarity and decrease beyond high efficiency video coding (HEVC), termed Versatile
the complexity of matrices involved in three types of transform Video Coding (VVC). And it has been finalized in 2020 [1].
operations. Thanks to the improved calculated scheme, a unified
shift-adder unit (SAU) is designed and highly reused by different Similar to HEVC, VVC is also based on hybrid coding
types. Moreover, we provide a twirling two-point splicing (T2S) scheme. With this scheme, many works [2], [3], [4], [5] in
scheme to improve reusability and deal with issues of data literature contributed to providing residual data which removes
mismatch when conducting discrete cosine transform (DCT)-II of spatial and temporal redundancy as much as possible during
different sizes. As a consequence, an architecture with constant inter and intraprediction. After that, the residual data is trans-
throughput of 32 pixels/cycle is implemented and specified in
Verilog HDL. The synthesis results indicate that the application formed, quantized, and encoded into a bitstream. Various trans-
specific integrated circuit (ASIC)-based and field-programmable form schemes have been developed in the literature, among
gate array (FPGA)-based hardware architectures achieve signif- which discrete cosine transform (DCT)-II has become the
icant advantages both in area reduction and power consumption most popular solution because of its superior ability to balance
compared to existing methods in the literature. coding efficiency and computational complexity. It has been
Index Terms— Area-efficient, data arrangement, improved cal- theoretically shown in [6] and [7] that DCT-II can efficiently
culate scheme, transform architecture, unified shift-adder unit approximate the optimal signal-dependent Karhunen-Loeve
(SAU), Versatile Video Coding (VVC).
Transform (KLT) under the first-order stationary Markov
I. I NTRODUCTION assumption. What’s more, it was further mathematically

T HE ever-increasing demand for video content, coupled


with the emerging video applications, including high res-
olutions, high frame rate (HFR), high dynamic range (HDR),
proved in [8] that the optimal horizontal (vertical) transform
applied on the intraprediction residual of the horizontal (verti-
cal) prediction mode is actually discrete sine transform (DST)-
VII with a first-order Gauss-Markov model for image pixels.
Manuscript received 15 June 2022; revised 12 November 2022;
accepted 9 February 2023. Date of publication 22 February 2023; date of To better adapt to the dynamic characteristics of image con-
current version 26 April 2023. This work was supported in part by the tents, numerous transform schemes [9], [10], [11] have been
National Natural Science Foundation of China under Grant 62031009, in part proposed in the past decades. In the early working drafts of
by the Alibaba Innovative Research (AIR) Program, in part by the Fudan-ZTE
Joint Laboratory, in part by the CCF-Alibaba Innovative Research Fund VVC, the adaptive multiple transforms (AMTs) was proposed
for Young Scholars, in part by the Japan Science and Technology Agency and four additional transform types, including DST-VII, DST-
(JST) under Grant JPMJPR19M5, in part by the Japan Society for the I, DCT-V, and DCT-VIII were introduced. According to the
Promotion of Science (JSPS) under Grant 21K17770, in part by the Kenjiro
Takayanagi Foundation, and in part by the Foundation of Ando Laboratory. test results in [10], −3.1% Bjøntegaard-Delta-rate (BD-rate)
(Corresponding authors: Heming Sun; Yibo Fan.) reduction for ALL-intra (AI) configuration, and up to −3.6%
Zhijian Hao, Xiaoyang Zeng, and Yibo Fan are with the State Key Labora- and −4.0% BD-rate reduction on 2 and 4 K content were
tory of ASIC and System, Fudan University, Shanghai 200433, China (e-mail:
[email protected]; [email protected]; [email protected]). achieved. However, the coding gain of AMT is obtained at the
Heming Sun is with the Waseda Research Institute for Science and Engi- expense of high computational complexity since five kinds
neering, Tokyo 169-8555, Japan, and also with JST, PRESTO, Kawaguchi, of transform types need to be evaluated. In the latest VVC
Saitama 332-0012, Japan (e-mail: [email protected]).
Guoqing Xiang and Peng Zhang are with the Advanced Institute of standard [12], the simplified AMT, named multiple transform
Information Technology, Peking University, Hangzhou 311215, China (e-mail: selection (MTS) is adopted as the primary transform scheme
[email protected]; [email protected]). where three types remain, and five combinations of transform
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2023.3245291. types are supported, that is, (DCT-II, DCT-II), (DST-VII,
Digital Object Identifier 10.1109/TVLSI.2023.3245291 DST-VII), (DST-VII, DCT-VIII), (DCT-VIII, DST-VII), and
1063-8210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 659

TABLE I To better realize the hardware architecture of VVC trans-


BASIS F UNCTIONS OF DCT-II/VIII AND DST-VII form module, several works [21], [22], [23], [24], [25],
[26], [27], [28] have investigated the implementation of MTS
with DST-VII and DCT-VIII. Mert et al. [21] realized a
high-performance architecture supporting these two transform
types for 4 × 4 and 8 × 8 sizes. It requires more resources
to enable larger sizes. Garrido et al. [22] proposed a unified
1-D hardware design for block sizes ranging from 4 × 4 to
32 × 32. This work has been extended to support 2-D design
in [24] which uses dual-port random access memory (RAM)
as transpose memory. To further adapt to the transform scheme
in VVC, the work in [23] improved from [24] by proposing
a deeply pipelined high-performance architecture and taking
(DCT-VIII, DCT-VIII). Compared to HEVC, the introduction the asymmetric block size combinations into consideration.
of additional transform types leads to a more complex Kammoun et al. [25] presented a 1-D multiplier-less imple-
rate-distortion optimization (RDO) process in VVC, increasing mentation of small sizes (four-point and eight-point) trans-
software run time and hardware resource consumption. form modules. This has been extended to 2-D design of all
Based on the above facts, efficient transform architec- block sizes including rectangular ones in [26] at the cost of
ture will be of great benefit to hardware video encoders. high logic utilization. Fan et al. [27] proposed a pipelined
Over the past decades, numerous architectures have been transform architecture that enabled all three types, where
developed for DCT-II as its indispensable contribution to the N -dimensional reduced adder graph (RAG-n) algorithm
compression coding. Most of the existing DCT-II hardware was adopted. However, only the DCT-VIII and DST-VII
design was implemented based on multiple constant multi- employed the RAG-n algorithm, leading to lots of multi-
plication (MCM) employing shift-add operations instead of pliers in the remaining DCT-II module. Zhang et al. [28]
multiplication operations to optimize resources [13], [14], [15], investigated the inherent features existing in DST-VII and
[16], [17], [18], [19], [20]. The MCM units in [13] share DCT-VIII. These features enable a more efficient calculation
common sub-expressions to reduce hardware resources. But which only uses partial elements to derive identical results,
it is not universal for all transform sizes. To improve the which is of great significance in accelerating software running
reusability, the work in [15] proposed an architecture reusable time.
for DCT-II. This architecture supports a fixed throughput of As mentioned, most existing works in literature optimized
32 coefficients/ cycle irrespective of transform size at the the designs of different types of transforms independently.
expense of high resource consumption. The works in [16] Proposing a unified architecture for three transform types is
and [17] presented reusable designs based on traditional DCT still a challenge. In this article, we aim to solve this problem
architecture, they utilized the property of butterfly architecture with the following contributions.
where the elements of a smaller DCT matrix size are a subset 1) We investigated the features existing in different trans-
of the elements of a larger DCT matrix size. This leads to low form matrices and extracted the features that are benefi-
utilization of the hardware resources when performing small- cial to design a unified architecture.
size transforms. Zheng et al. [18] proposed an architecture 2) Based on these features, we proposed an improved calcu-
that maximally reuses the hardware resources by rearranging lation scheme for transforms, which increases the matrix
the order of input data for different transform sizes while still similarity and decreases the computational complexity
exploiting the butterfly property. Masera et al. [19] proposed when conducting transformation of different types.
a recursive structure of a fast algorithm, which decomposes 3) With high similarity, a unified and area-efficient shift-
the N -point DCT-II by means of two N /2-point DCT-II, thus adder unit (SAU) is designed. It can be reused by all
maximizing the hardware reusability while maintaining a con- transformations regardless of size and type, reducing lots
stant throughput. Besides, the approximate transform scheme of resources and making it possible to realize a unified
has become a focus of research on account of significant area architecture.
reductions and negligible performance loss. Sun et al. [14] 4) To deal with the data mismatch when conducting DCT-II
presented an area-efficient approximate method for the 32- of different sizes, we provide a data rearrangement
point transform for HEVC and then extended this work to [20]. scheme named twirling two-point splicing (T2S), which
Such an approximate method reduces the computation cost guarantees the validity and enables high reusability of
of DCT by truncating a couple of the least significant bits adders.
(LSBs), most significant bits (MSBs), and zero columns. 32% 5) A unified transform architecture with five elaborated
area consumption and 60% power consumption can be reduced modules is presented, which can be reconfigured to
compared with the original accurate DCT. These works have perform three transformations adopted in VVC. To the
made significant contributions to transform coding oriented best of our knowledge, it is the first application specific
to HEVC. However, they can not be well extended to VVC integrated circuit (ASIC) unified implementation for VVC
because the matrices of DST-VII and DCT-VIII are totally MTS and shows a certain advantage over other related
different from DCT-II. works.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
660 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 5, MAY 2023

Fig. 1. Transform matrices for different types with N = 8. C-II means DCT-II, respectively, C-VIII means DCT-VIII and S-VII is DST-VII.

The rest of this article is organized as follows. Section II the complexity of the transform operations. Therefore, we offer
presents the improved calculation scheme for general trans- a brief analysis of transform matrices of DCT-II, DST-VII, and
forms. In Section III, we describe the unified and reconfig- DCT-VIII in the case of N = 8 depicted in Fig. 1 to investigate
urable architecture in detail. Section IV shows the synthesis the features which are potential to design the unified hardware
results, which demonstrate the advantages of the proposed architecture.
design over existing work in the literature. Section V con- In DST-VII and DCT-VIII, two typical features can be
cludes this article. observed and have been utilized in many works of literature,
which are considered useful for a unified implementation of
II. I MPROVED C ALCULATION S CHEME FOR T RANSFORMS these two types. They are described in detail below.
A. Properties of Transforms in VVC 1) Mutual Duality of DST-VII and DCT-VIII: By comparing
the DST-VII and DCT-VIII matrices where row index starts
The primary transform types in MTS consist of DCT-II,
from 0, it is concluded that the even-index rows in two
DST-VII, and DCT-VIII. Their basis functions are shown
matrices have the same coefficients but in inverse order, and
in Table I. A common method to reduce the computational
the odd-index ones have the opposite coefficients and inverse
complexity is to decompose the 2-D transform into two 1-D
order. That is, specifically, for even-index rows
transforms according to the separability of the transforms. For
each W × H residual block, H 1-D horizontal transforms (a, b, c, d, e, f, g, h) ⇐⇒ (h, g, f, e, d, c, b, a)
are performed first to obtain the intermediate data. Next, W
DST-VII ⇐⇒ DCT-VIII
1-D vertical transforms are carried out to generate the final
transformed coefficient blocks. The types of horizontal and and for odd-index rows
vertical transforms are determined by the RDO process and
are not meant to be the same. For all types of transform, the (a, b, c, d, e, f, g, h) ⇐⇒ (−h, −g, − f, −e, −d, −c, −b, −a)
1-D transform can be calculated as follows: DST-VII ⇐⇒ DCT-VIII.
Y = TN × X T (1) 2) Self-Circulation of DST-VII and DCT-VIII: For DST-VII
where X is a 1 × N input row/column vector, TN is the N × or DCT-VIII, the coefficients in each row/column of the
N transform matrix and Y is a N × 1 output vector with the transform matrix are obtained using a unified set of coefficients
result obtained. sharing the same absolute values. Only the position and sign
The 2-D transform is conducted by two successive 1-D of each coefficient can change among different rows/columns
transforms according to the separability of triangular trans- in the matrix.
forms. This process can be summarized as follows: Such features also exist in transform matrices of different
T
sizes. It is convenient to design a reusable architecture for
Y = TN × Ytmp DST-VII and DCT-VIII by exploiting these two features. With
T
= TN × TN × X T Feature#1, the circuit of DST-VII can easily perform the
function of DCT-VIII by carrying out two simple operations:1)
= TN × X × TNT . (2)
revert the input order and 2) assign the opposite sign to the
In this case, Ytmp is the intermediate matrix that holds odd-index outputs. On the other hand, Feature#2 enables an
the result of the first 1-D transform. Between two trans- efficient implementation for DST-VII/DCT-VIII. Specifically,
forms, a transpose memory is necessary to achieve efficient we can design an SAU circuit to generate all the product
transposition of the intermediate matrix. Thus, the typical terms acquired in the transform process as presented in [27]
2-D transform hardware contains three modules: a 1-D row and [29], where the multipliers are replaced by adders and
transform module, a 1-D column transform module, and a shifts. The SAU-based transform scheme will be described in
transpose memory. Section III-B in detail.
The process of 1-D transform can be formulated as a matrix Before investigating the features in the DCT-II matrix, let’s
multiplication of an N × N transform matrix and an N × do a brief review of the well-known fast DCT-II algorithm [15]
1 input vector. The properties of transform matrices determine named Even-Odd decomposition algorithm, also known as

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 661

Feature#1 and Feature#2 don’t explicitly exist in the original


DCT-II matrix TN , so the common features for DCT-II and
DST-VII/DCT-VIII are still obscure. In fact, the Butterfly
algorithm is widely used when implementing DCT-II. There-
fore, the matrices involved in the DCT-II are actually the
sub-matrices, Tno and T2e , instead of the original transform
matrix TN . What contributes to a unified architecture for three
transform types is that Feature#2 can be found in all Tno and
T2e matrices. Besides, the Tno and T2e matrices keep the same
Fig. 2. Process of Butterfly algorithm adopted in 16-point DCT-II transform. regardless of the size N . For example, T4o is always the same
Step1 contains three depths corresponding to recursive Even-Odd decompo- as shown in (5) no matter what size of transform is being
sition processes. Matrices multiplications are conducted in Step2.
performed. We denote them as Feature#3 and feature#4 related
to DCT-II.
Butterfly algorithm. As depicted in Fig. 1, for the DCT-II 3) Self-Circulation of Tno and T2e in DCT-II: In the But-
matrix, the elements in the even-index rows are symmetric terfly algorithm, the coefficients of different rows in Tno and
about the solid line, and the odd-index ones are anti-symmetric T2e matrices have the same value only in different orders and
about the dotted line. In the Butterfly algorithm, the inputs signs.
that are multiplied by the same matrix element would be 4) Invariance Property of Tno and T2e in DCT-II: The Tno
added or subtracted according to the symmetry and anti- and T2e keep the same when transform size N changes.
symmetry, which reduces the number of multipliers required With these remarkable features, it is possible to establish a
subsequently [15]. The Butterfly algorithm can be described unified and reconfigurable architecture for these three types of
into two steps, which is illustrated in Fig. 2 with N = 16. transforms in VVC.
Step1: The input data will be separated into even and
odd vectors by addition and subtraction operations
according to symmetry and anti-symmetry proper- B. Improved Calculation Scheme for Transforms
ties. The even vectors can be separated into sub-even In this part, we propose an improved calculation scheme for
and sub-odd vectors recursively. This process is transforms to decrease the complexity of matrix multiplication.
formulated as (3) and (4). As mentioned, the primary calculation operations in all 1-D
Even vectors transforms are carried out by multiplying the N × N matrices
and N × 1 input vectors where the N × N matrices pos-
e[i] = x[i] + x[N − 1 − i]
sess Feature#2. It is exactly the case for DST-VII/DCT-VIII.
ee[i] = e[i] + e[N /2 − 1 − i] Regarding DCT-II, it is reflected in the calculation of E n O and
··· E n E parts in the Butterfly algorithm as Feature#3 described.
en e[i] = en−1 e[i] + en−1 e[N /n − 1 − i]. (3) Work in [27] contributed to designing an SAU-based 2-D
transform architecture for DST-VII/DCT-VIII by exploiting
Odd vectors Feature#2. Nonetheless, they didn’t go into Feature#3 and
o[i] = x[i] − x[N − 1 − i] didn’t optimize DCT-II with the proposed transform scheme.
Taking Feature#3 into consideration, we can easily develop
eo[i] = e[i] − e[N /2 − 1 − i]
a similar SAU-based architecture for DCT-II, in which the
··· output of SAUs corresponds to the DCT-II matrices instead of
en o[i] = en−1 e[i] − en−1 e[N /n − 1 − i]. (4) DST-VII. Nonetheless, resource consumption is not reduced
much due to the poor reusability caused by the difference
Step2: After deriving the odd vectors and even vectors, the
between DCT-II and DST-VII transform matrices. For exam-
odd vectors: O∼E n O and the last even vector E n E
ple, for N = 4, the SAUs in the DCT-II case are expected
will be multiplied by the corresponding odd matrices
to offer product terms {18x, 50x, 75x, 89x} according to
TNo and even matrix T2e to obtain the transformed
T4o showed in (5). In DST-VII case, the SAUs are designed to
coefficients as depicted in Step2 in Fig. 2.
output {29x, 55x, 74x, 84x} according to the DST-VII trans-
Y is the transformed coefficient vector, Tno and T2e are the form matrix T4 as shown in the following equation:
sub-matrices extracted from the N -point transform matrix TN  
and corresponding to each depth in the Butterfly algorithm. 29
 55 74 84 
To clarify, we offer T4o , T2o and T2e in (5), which can be 74 74 0 −74
 
T4 = . (6)
observed in the DCT-II transform matrix with size N = 8 
 84 −29 −74 55 
55 −84 74 −29
   
 89
 75 50 18 
  
75 −18 −89 −50 83 36
 
T4o = , T2o = The circuits of these two types of SAUs are composed of
 50 −89
 18 75 
 36 −83 different combinations of adders and shifts. Therefore, they are
18 −50 75 −89
 
  not convenient to reuse. To overcome this challenge, inspired
64 64 by Zeng et al. [30] and Hao et al. [31], we propose an
T2e = . (5)
64 −64 improved calculation scheme in which the matrix elements

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
662 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 5, MAY 2023

TABLE II
R ANGE OF A BSOLUTE VALUE IN Q N /Q oN AND R N /R oN M ATRICES

Fig. 3. Unified 1-D transform architecture.

required in all transforms are limited to a small range, and


relative multiplication can be conducted with a unified SAU.
This optimized scheme reduces the computational complexity
to a large extent and is suitable for three types of transform.
It is described below.
Firstly, the transform matrix TN is decomposed into two
simpler matrices named the Quotient matrix Q N and Remain-
der matrix R N as follows:
 
 s 0 · · · 0 Fig. 4. Even-Odd decompose process for the combined input vector
consisting of eight four-point rows data.

 

0 s · · · 0

 

TN = Q N × . .. .. ..  + R N (7)
 .. . . . operations as mentioned before. Obviously, when s is equal to

 


 

0 0 ··· s 16, these two matrices share a similar level of complexity.
 
Then the optimized 1-D transform scheme can be described
where the s is a scaling factor with the value of power-of-two as follows:
(2n ), set by the trade-off on the complexity of Q N and R N . The
computational complexity of matrix operation is proportional Y = TN × X T
to the absolute value of elements in Q N and R N . As the s = Q N × X T × s + RN × X T (10)
increases, the complexity of Q N decreases and that of R N
increases. For each entry in TN , its decomposition equation is where the X is the input residual vector, and the
given by the following equation: resource-intensive matrix operation is carried out by two
 simpler matrix multiplications and shift operations. What’s
 TN (i, j) = Q N (i, j) ∗ s + R N (i, j) more, the similar complexity of Q N and R N enables a
(8)
 |R N (i, j)| ≤ s reusable design when conducting these two matrix multipli-
2 cations. A resource-saving and unified SAU circuit with out-
put {1x, 2x, 3x, 4x, 5x, 6x, 7x, 8x} is capable of conducting
where i, j ∈ (0, N − 1) are indexes of rows and columns.
multiplications related to all Q N s and R N s, which will be
Apply (8) to the DCT-II sub-matrix T4o (5) and DST-VII matrix
discussed in Section III-B.
T4 (6), corresponding Q N and R N matrices can be derived and
This optimized calculation scheme can be directly applied
presented in the following equation:
    to DST-VII/DCT-VIII with the SAUs-based 1-D transform

 2 3 5 5  
 −3 7 −6 4  scheme according to Feature#2. For DCT-II, applying it to


5 5 0 −5




 −6 −6 0 6

 calculation processes in the Butterfly algorithm can further
Q4 = , R4 = reduce the overall computational complexity as specified in
5 −2 −5 3 
   4 3 6 7
  Feature#3.

 
 
 

3 −5 5 −2 7 −4 −6 3
   
  III. U NIFIED R ECONFIGURABLE T RANSFORM
6 5 3 1 
 
−7 −5 2 2 





 
  A RCHITECTURE FOR VVC
5 −1 −6 −3
−5 −2 7 −2
 
o
Q4 = , R4 =
o
. (9) In this section, we proposed a unified reconfigurable archi-

 3 −6 1 5   
 2 7 2 −5  tecture that can support the reusability of various transform
2 −2 −5 7

 
  

1 −3 5 −6
 sizes and types in VVC. The presented architecture consists of
five modules: SAUs inputs selection, unified SAUs calculator,
With the decomposition algorithm adopted, the value of adder-trees inputs selection, adder-trees calculator, and outputs
elements in the transform matrix is broken down into two selection, plotted in Fig. 3.
smaller values, which are more suitable for multiplier-less
transform circuit designs. Note that the maximum entries of
the DST-VII and DCT-II matrix are 91. We can acquire the A. SAU Inputs Selection
range of absolute value in Q N and R N matrices presented in This module is designed to prepare the input vectors sent
Table II, which is proportional to the complexity of matrix to SAU calculators. It is expected to deal with various sizes

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 663

Fig. 5. Rearrangement with T2C scheme for the inputs of Even-Odd decomposition circuit with different sizes and depths. (a) For N = 32, depth0. (b) For
N = 16, depth0; for N = 32, depth1. (c) For N = 8, depth0; for N = 16, depth1.

Fig. 6. Outputs of the SAU inputs selection module under the configurations of different sizes and types. (a) Cases of DST-VII/DCT-VIII. (b) Cases of
DCT-II.

and types. In order to make it reconfigurable for various and last two data are added and subtracted to obtain the
transform types, there are two datapaths available in this corresponding even vector and odd vector in each row. This
module. One is for DST-VII/DCT-VIII, and the other can be Even-Odd decompose process is illustrated in Fig. 4.
configured to perform the Even-Odd decomposition in the For reusing these adders and subtractors when processing
Butterfly algorithm Step1 for DCT-II. The only difference different sizes, the problem of data mismatch needs to be
between DST-VII and DCT-VIII is whether the data is input handled carefully. For example, for two 16-point transforms,
forward or reverse. two rows are spliced into a combined 32-point input vector.
To improve resource utilization and achieve constant Delivering the combined 32-point input vector directly to
throughput, we designed a unified interface with 32 residual the Even-Odd decomposition circuit in Fig. 4 would make
inputs. When conducting sizes less than 32, multiple rows mistakes because data in the error position are added and
or columns are combined as the input data. Therefore, this subtracted. Specifically, for X 0 , it needs to be added with X 15
reconfigurable architecture can conduct one 32-point or two to obtain e[0]. However, it would be added with X 3 in Fig. 4
16-point or four eight-point, or eight four-point transforms, and a wrong e[0] would be generated. Admittedly, designing
controlled by the size flag. different Even-Odd decomposition circuits for different sizes
In this module, the main calculation operation concentrates can be a solution to this data mismatch problem, but addi-
on the Even-Odd decomposition when conducting the DCT-II tional resource consumption is introduced. To overcome this
transform. As described in (3) and (4), multiple adders and shortcoming and improve resource utilization, we propose a
subtractors are necessary for extracting the E n O and E n E rearrangement scheme for all the combined 32-point vectors,
vectors. For eight four-point DCT-II transforms, the first two depicted in Fig. 5.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
664 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 5, MAY 2023

Taking two 16-point transforms as an example, the specific TABLE III


process of this rearrangement is described as follows. The C OMPARISON OF SAU D ESIGNS W ITH P REVIOUS W ORK
overall position of two original rows in the combined 32-point
vectors keeps unchanged. That is, X 0 to X 15 are still the
first-row data, and X 16 to X 31 are the second-row data. The
internal data in each row need to be rearranged by what we
named T2S scheme, which is concretely demonstrated in
Fig. 5. In terms of N = 16, T2S is reflected in that {X 0 , X 1 }
is spliced with {X 14 , X 15 }, and then {X 2 , X 3 }, and followed
by {X 12 , X 13 }, and so on. After such a rearrangement,
the first 16 data of rearranged 32-point vector becomes
{X 0 , X 1 , X 14 , X 15 , X 2 , X 3 , X 12 , X 13 , X 4 , X 5 , X 10 , X 11 , X 6 , X 7 ,
X 8 , X 9 }. Sending this rearranged vector to the decomposition
circuit presented in Fig. 4 ensures the correctness of
decomposition.
Assuming that p is the original position of data in the
32-point input vector, (11) gives the principle of switching
the origin position p to the rearranged position p ′

N
 2(R/2 + 1) + R%2,

 R<
2
p′ = Q +   (11) Fig. 7. (a) Unified SAU. (b) SAUs-based matrix multiplication for N = 4.
N N
2 − R/2 + R%2, R ≥



2 2
where Q = p/N means the quotient of p derived by the Y = Q o4 × X T as an instance, we make a brief presentation
transform size N and R = p%N is the corresponding remain- in Fig. 7 of how the SAUs-based matrix multiplication works.
der. For X 15 in the case of N = 16, whose corresponding Firstly, the value {1, 3, 5, 6} in Q o4 can be derived from value
Q and R are 0 and 15, its original position in the combined {18, 50, 75, 89} in DCT-II transform matrix T4o according
32-point vector is 15, and its rearranged position p ′ is 0 + 2 × to (10). Therefore, each SAU in the SAUs-based matrix
(8 − 7) + 1 = 3 according to (11). Similarly, the rearranged multiplication is expected to contain at least four outputs
position of X 3 is 0 + 2 × (1 + 1) + 1 = 5. This is {1x, 3x, 5x, 6x} and four SAUs are employed as depicted in
exactly in accordance with the description of Fig. 5(b). This Fig. 7(b). The intermediate outputs of these four SAUs would
rearrangement scheme can work for all allowed combined be added selectively by an adder-tree calculator to obtain the
sizes. result.
In the traditional butterfly architecture, thanks to the sym- To save resources and provide a highly reused architecture,
metry of the transform matrix, the E n E obtained from the we propose a unified SAU shown in Fig. 7(a), which consists
Odd-Even decomposition can be further decomposed. There- of three adders and four shifts. As mentioned in Table I,
fore, performing such rearrangement processes for E n E in dif- the absolute values in Q N and R N range from 0 ∼ 8,
ferent decomposition depths is necessary. The rearrangement so the unified SAU possesses one input port and eight output
for Depth1 is a subset of that in Depth0 as framed by a red ports. Compared to the classical matrix multiplication, the
dotted line in Fig. 2. Only sizes 16 and 32 need to perform SAUs-based implementation reduces the number of multipli-
such arrangements in Depth1. The size of 8 in Depth1 becomes ers by 16 (0 in SAUs-based case, 16 in classic case), and
a regular case for the Even-Odd decomposition circuit, just like increases the number of adders by 28 (40 in SAUs-based case,
the size of 4 in Depth0. In our design, the rearrange operations 12 in classic case), which significantly reduces the hardware
need to be conducted at most three times, corresponding to the resources.
case of the 32-point transform. To achieve a constant throughput of 32 pixels/cycle, 32 uni-
To clarify the function of this module more clearly, we pro- fied SAUs are required in the proposed architecture. As the
vide Fig. 6 to illustrate the output situations under various outputs of each SAU cover all the absolute values in Q N
configurations. For DST-VII/DCT-VIII, the outputs with dif- and R N matrices, the product terms in Q N and R N matrix
ferent sizes are summarized in Fig. 6(a), which are composed multiplications can be generated at the same time without
of the original input data. In the case of DCT-II, the outputs doubling the number of unified SAUs, which further improves
consist of E n O and E n E vectors generated by Even-Odd the circuit reusability.
decomposition circuits. Each output is connected with one Benefiting from the optimized calculation scheme, the uni-
adder-tree calculator to generate required product terms, which fied SAU-based scheme saves many logic resources compared
will be described in detail in Section III-B. to the state-of-the-art work. Fan et al. [27] presented multiple
kinds of SAUs for different transform sizes. The comparison
B. SAU Calculator between the proposed unified SAU and SAUs in [27] is
This module performs to generate required product terms illustrated in Table III. Their SAUs of different sizes can
when conducting SAUs-based matrix multiplications. Giving not be reused. Therefore, to achieve a constant throughput

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 665

Fig. 8. Diagram of an adder-tree calculator with 16 input data ports and one
output data port. The red dotted line marks the critical path, which consists Fig. 9. (a) High-frequency zeroing adjustment method. (b) Reuse scheme
of four adders. for DST-VII.

of 32 pixel/cycle, design in [27] needs to prepare 32 four- 16 transformed coefficients needn’t be calculated, and the 16th
point SAUs, 32 eight-point SAUs, 32 16-point SAUs and 32 to 31th adder-tree calculators can be reused for obtaining the
32-point SAUs. However, 32 unified SAUs are enough in our first 16 transformed coefficients. The reuse scheme is shown
design, showing obvious advantages. in Fig. 9(b) where Oi means the outputs of the previous
adder-trees selection module and Ri is the corresponding
C. Adder-Tree Calculator and Input Selection results. Based on the analysis above, the first 16 outputs of
the adder-trees selection module contain 32 product terms data
In the proposed architecture, there are 64 adder-tree cal-
and the last 16 outputs contain only 16 ones. When conducting
culators. Half of them are used for summing up product
the 32-point DST-VII, the first 16 product terms in O0 –O15
terms during the process of Q N matrix multiplications and
are delivered to adder-trees calculators A0 –A15 , and the last
the others are for R N , which ensures the overall design’s high
16 product terms in O0 –O15 are delivered to A16 –A31 . After
parallelism and pipelining ability. Each adder-tree calculator
adder-tree calculators, the results of Ai and A32−i would be
with 16 input data ports and an output port performs summing
added to get the transformed coefficients. Thanks to Feature#1,
up the 16 input data and outputs the result. Its simple diagram
the case of DCT-VIII can be well performed using this
is plotted in Fig. 8. The critical path is composed of four
reusing circuit with a simple reverse operation. For DCT-VIII,
adders according to Fig. 8.
reversing the input order from O0 –O31 to O31 –O0 can generate
The adder-tree input selection module picks out the needed
the corresponding low-frequency coefficients in R0 –R15 .
product terms for each adder-tree calculator. After the SAUs
calculator, all needed product terms are available during the
D. Outputs Selection
matrix multiplication process. According to the requirements
of different types and sizes of transform, this module selects After summing up the results of Q N and R N , the derived
corresponding data from the outputs of unified SAUs. transformed coefficients are sent to the Outputs selection mod-
However, there is a problem that needs extra attention. ule to adjust their orders and signs. The proposed architecture
That is, for DST-VII 32-point transform, the transformed designs the default datapath based on DCT-II and DST-VII.
coefficient can’t be obtained directly because 32 product Therefore, the coefficients in the odd index need to be negated
terms are selected from different unified SAUs’outputs and when conducting DCT-VIII according to Feature#1. Besides,
need at least 31 adders to obtain one coefficient, while one for DCT-II, the orders of coefficients should be adjusted
adder-tree calculator aforementioned can only sum up 16 prod- according to the Butterfly algorithm. Taking the 32-point DCT-
uct terms. For the sake of saving resources and improving II transform as an example, for the last 16 coefficients in O
reusability, we offer a solution to this problem by taking the vector as depicted in Fig. 6(b), the actual position of them
high-frequency zeroing adjustment method adopted in VVC are in the odd index form Y1 to Y31 so that arranging their
into consideration. A brief introduction to the high-frequency orders is necessary to facilitate the correct reading of later
zeroing adjustment method is described here and might clarify modules.
the efficiency of this solution.
To align the worst-case multiplications per coefficient with IV. E XPERIMENTAL AND S YNTHESIS R ESULTS
HEVC, for 64-point DCT-II and 32-point DST-VII/DCT-VIII, In this article, a unified 1-D and 2-D transform architecture
only the first 32 and 16 low-frequency coefficients are kept in are proposed and specified in Verilog HDL, which can be
VVC, respectively, the high-frequency coefficients are zeroed reconfigured to perform all transform types with rectangular
out. Fig. 9(a) depicts how the high frequency zero adjustment sizes ranging from 4 × 4 to 32 × 32 allowed in VVC
method works in DST-VII for N = 32. Specifically, for standard. Three kinds of 1-D designs are presented in this
each row or column in the 1-D transforms, the first 16 trans- article, distinguished by types to make a detailed comparison
formed coefficients {Y0 , Y1 , . . . , Y15 } are reserved, and the last with the existing state-of-the-art works: DCT-II-only, DST-
16 ones {Y16 , Y17 , . . . , Y31 } are set to zero. Therefore, the last VII/DCT-VIII-only, and all types. A 2-D architecture enabling

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
666 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 5, MAY 2023

all types is enough to illustrate the advantages of the proposed NPC indicators, the proposed DCT-II architecture still shows a
design because it is extended from 1-D architecture with the certain advantage with 2.5 versus 5.0, though a more advanced
same datapath but a larger bit-width and enjoys the same technology library contributes to the results.
area and power-consumption advantages as 1-D architecture. When it comes to DST-VII/DCT-VIII, Zhang et al. [28]
To validate the design, we export the residual data derived investigated the inherent features in DST-VII and DCT-VIII,
from intra or interprediction in VVC reference software VTM and proposed an algorithm of great significance to accelerate
16.0 as test vectors, send it to the design module, and the software calculation time. However, it is not suitable for
compare the output results with the corresponding coefficients unified hardware architecture designs because the execution
in VTM 16.0. Abundant video sequences, including Class process varies with different sizes. What’s more, unlike 1-D
A (2560 × 1600) to Class E (1280 × 720) with QP = DCT-II with properties of symmetry and anti-symmetry,
{22, 27, 32, 37} are tested, and all the results of test vectors are which can be utilized by the Butterfly algorithm to reduce
validated. the computational complexity, the 1-D DST-VII/DCT-VIII
We synthesize the proposed architecture with Design Com- architecture generally carries a larger amount of calculation
piler using TSMC 65 nm cell library and the max working and heavier resource consumption than DCT-II. This can
frequency is up to 476 MHz. The transpose memory in be observed from the Gate Count in Table IV. The main
the proposed architecture is designed based on the diagonal comparison work is [29], which was expanded into a 2-D
data mapping scheme which was also adopted in [3], [27], architecture [27]. Table IV indicates that the proposed 1-D
and [32]. Tables IV–VII present the detailed comparison with DST-VII/DCT-VIII architecture reduces the NA by 33.5%.
the comparison list of works. It is well known that stronger Though power dissipation analysis is not available in [29],
processing ability often requires a larger area and power it can be inferred that a smaller area ensures a lower NPC
consumption. For a more fair comparison, the normalized area indicator under the same technology library.
(NA) introduced in [27] and normalized power consumption A comparison of theoretical resource consumption is
(NPC) are adopted and defined as follows: described here and could explain the proposed DST-VII/DCT-
VIII architecture’s area advantage compared to [29]. Since
Gate
NA = (12) addition and shift are used to replace multiplication, the
MaxFreq × Pixel Parallelism theoretical resource comparison is carried out by estimating
Energy Consumption the number of adders. For a constant throughput of 32 pix-
NPC = (13)
MaxFreq × Pixel Parallelism els/cycle, our design needs 32 unified SAUs and 64 adder-trees
where MaxFreq is the maximum working frequency of the calculators (32 for Q N operations and 32 for R N operations).
design and selected as the operating frequency in our exper- The total number of adders is 1056 (32 × 3 + 15 × 64).
iments, Energy consumption is the corresponding power dis- In [29], SAUs of different sizes can’t be reused due to their
sipation parameter, and Gate means the equivalent number of difference, so 32 32-point SAUs, 32 16-point SAUs, 32 eight-
gates estimated from the total area and the area of two-input point SAUs and 32 four-point SAUs are necessary for their
NAND gates. Pixel parallelism indicates the number of pixels design. Therefore, its SAUs module contains 1568 adders
processed in each cycle. according to Table III (6 × 32 + 7 × 32 + 14 × 32 +
22 × 32). Additionally, they need 31 × 32 or 16 × 32 adders
to deal with the results of SAUs, which depends on whether
A. Synthesis of 1-D Transform they consider the high-frequency zero adjustment method.
Table IV gives a detailed comparison of 1-D transform Based on these analyzes, the work in [29] consumes more
schemes between the proposed architecture and the state-of- adder resources than the proposed architecture, thus leading
the-art works. Most of them only support a single kind of to a larger area than ours.
transform type, like [29], [15], [33], and [19]. The complete The work [34] is the only architecture supporting three
ASIC implementations of three transform types are few and types of transform with ASIC technology in the comparison
remain a key research issue. list. It was synthesized at a high frequency of 600 MHz
From the perspective of DCT-II, Table IV shows that the and claimed to be able to process 4 K@30 frames/s video
proposed design provides better performance in terms of sequences in real time. Its Gate Count is far less than our
resource consumption and power dissipation. The NA is only design relying on lower sample parallelism, that is, 2 pix-
9.3 thanks to various methods of circuit reusability, while els/cycle, as described in their paper. Consequently, the NA
the NA of [15], [33], and [19] are 43.8, 10.8, and 29.4, reaches up to 80.7, which is about four times that of us.
respectively. Compared to them, our design has a smaller NA To sum up, our design has achieved the best NA and NPC
and achieves area reductions of 78.8%, 14%, and 68.4%. indicators in the comparison of single transform type and all
One of the main concerns nowadays in ASIC projects is transform types, showing obvious advantages.
power dissipation/energy consumption. However, only [33] It is worth noting that, in our design, the sum of NA of DST-
offered a power analysis of 1-D transform designs. Note that VII/DCT-VIII architecture and DCT-II architecture is 24.2 and
the power estimation in [33] is done with the clock constrained is larger than the NA of all type architecture 21.6. This means
at 100 MHz instead of its max frequency of 256 MHz so that that compared with the independent design of three types of
we provide a result of synthesizing our design at 100 MHz, transform, such unified architecture is meaningful and valuable
as presented in Table V as well. As can be seen from the because it further reduces resource computation even when

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 667

TABLE IV
C OMPARISON W ITH OTHER W ORKS FOR 1-D T RANSFORM H ARDWARE I MPLEMENTATION

TABLE V throughput according to identical pixel parallelism and higher


S YNTHESIS OF THE P ROPOSED 1-D DCT-II A RCHITECTURE AT 100 MHz operating frequency.
Regarding the NA and NPC indicators, the lower NA is
achieved when compared to [27] thanks to the better data
processing ability. The NPC of our work is larger than both
but smaller than their sum. This is reasonable and competitive
considering the comprehensive functional coverage and high
frequency.
Interestingly, the total area (Gate Count) of [27] and [19] is
smaller than ours. Are the advantages of our proposal going to
disappear when merging these two works into a new design?
It wouldn’t happen, and the reason lies in NA. As we all
lots of control logic circuits are introduced to reconfigure the
know, the ASIC can work at a frequency that does not exceed
datapaths.
the max frequency, so we assume that the max frequency
will be about 250 MHz when designs in [27] and [19] are
B. Synthesis of 2-D Transform merged with the same technology. The merged Gate Count
To the best of our knowledge, the proposed work is the will become about 749 k. Consequently, the new NA can be
first ASIC-based 2-D implementation of MTS that supports derived and expected to be 93.6, which is far larger than our
all types allowed in VVC. Herein, our work is compared with design. It can also be inferred that the total power of this
other related 2-D proposals recently published in the literature. merged design shows no obvious advantage to ours. Therefore,
For a fair comparison, a short comparison of ASIC-based our unified design is still of a certain advantage in ASIC-based
and a detailed comparison of field-programmable gate array implementation of VVC transform.
(FPGA)-based have been taken into account since most imple- As for FPGA implementations, note that our proposal is
mentations of 2-D VVC transform schemes are based on mainly designed based on the ASIC principle and not spe-
FPGA, but only a few of them are realized with ASIC. Note cially optimized for FPGA implementation. Nevertheless, for
that the size of 64-point is not in our consideration out of a comprehensive comparison, the results of the FPGA-based
fairness, as most works in the compared list support up to case are also given in Table VI.
32-point. The extension from 32-point to 64-point can be The processing abilities of different proposals are compared
investigated in future work. here first. Garrido et al. [24], Kammoun et al. [26], and
Fan et al. [27] and Masera et al. [19] are two significant Zhang and Lu [35] were designed with non-constant pixel
instances. They presented the most advanced performance of parallelism. The single process element can be reused in
a single transform type among the few ASIC-based imple- different sizes so that the designs are conveniently realized
mentations. Fan et al. [27] claims to propose a fully pipelined with a small total area, and the throughput is destined to be non
architecture for VVC, but only a DST-VII/DCT-VIII archi- constant. As described in their papers, the implementations
tecture is synthesized. Masera et al. [19] proposed an area- in Garrido et al. [24], Kammoun et al. [26], and Zhang and
efficient fixed-point architecture for DCT-II and offered a fast Lu [35] support up to 2 K@50 frames/s, 4 K@30 frames/s,
algorithm to reduce resource consumption with accepted BD- and 4 K@18 frames/s respectively on average. Mert et al. [21]
rate loss. Compared to them, the most obvious contribution of and Kammoun et al. [36] achieved the constant throughput
our proposal is the full function coverage, which enables three of 8 and 32, and the corresponding processing ability is up
transform types. Besides, our work apparently has a higher to 8 K@35 frames/s and 4 K@96 frames/s. Though Mert

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
668 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 5, MAY 2023

TABLE VI
C OMPARISON W ITH OTHER W ORKS FOR 2-D T RANSFORM FPGA I MPLEMENTATION

TABLE VII are saved. Therefore, it can be concluded that with stronger
C OMPARISON W ITH OTHER W ORKS FOR 2-D T RANSFORM ASIC I MPLE - processing ability, our design achieves lower LUTs and DSPs
MENTATION
resource consumption than the state-of-the-art scheme. For
achieving a reconfigurable architecture for three types, com-
plex control logic circuits are unavoidable, which leads to
heavy consumption of FFs in our design when compared to
other works.
V. C ONCLUSION
This article proposes a reconfigurable and unified architec-
ture for MTS in VVC that can be configured to support all
transform types with square and rectangular sizes ranging from
4 × 4 to 32 × 32. To investigate the unified design, we present
a detailed analysis of the features existing in three types of
transform matrices. Based on these features, an improved cal-
culation scheme for general transform is proposed, where the
decomposition algorithm reduces the complexity of calculation
and enhances the similarity of different types of matrices.
Thanks to decomposition, an area-efficient and unified SAU is
et al. [21] claims to support the 8 K@35 frames/s video, only designed according to the range of entry values in transform
four-point and eight-point transforms are supported, which matrices, which can be highly reused by different sizes and
is not suitable for the MTS implementation in VVC. It can types of transforms. Then the SAU-based matrix multiplication
be derived that our design is capable of handling 8 K@34 contributes a lot to saving resource consumption. It can be
frames/s or 4 K@135 frames/s after a simple conversion, applied to DST-VII/DCT-VIII directly as Feature#2 described,
given 4 modes needed to be processed in the RDO of the and can also be utilized in DCT-II architecture where the T2P
encoder, which is of significant advantage. scheme ensures the function of the Butterfly algorithm and
Regarding resource consumption, 101 K look-up tables provides reusability of adders for different DCT-II sizes. Con-
(LUTs) and 47.3 K flip flops (FFs) are used in our design. sequently, we achieve a reconfigurable and unified architecture
Without multipliers, we don’t need to call any digital sig- with apparent advantages in terms of area and power compared
nal processor (DSP) resources. In the list of related works, to other ASIC- and FPGA-based implementations. Simulation
design in [36] has the strongest processing ability, therefore, and synthesis results show that the proposed architecture
we mainly compare with it. According to the structure of can support real-time UHD formats with 8 K@115 frames/s
adaptive logic modules (ALMs) in Arria 10 Systems on Chips and 4 K@135 frames/s corresponding to the ASIC-based and
(SoCs) FPGA device [37] used in [36], each ALM contains FPGA-based implementations.
six LUTs of two four-input LUTs and four three-input LUTs.
R EFERENCES
A rough conversion can be made for [36] that about 220 K
[1] B. Bross et al., “Overview of the versatile video coding (VVC) standard
LUTs will be consumed. In light of that 101 versus 220 K, and its applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 31,
our design is very competitive. Besides, 738 DSPs resources no. 10, pp. 3736–3764, Oct. 2021.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 669

[2] C. T. Huang, M. Tikekar, and A. P. Chandrakasan, “Memory-hierarchical [21] A. C. Mert, E. Kalali, and I. Hamzaoglu, “High performance 2D
and mode-adaptive HEVC intra prediction architecture for quad full HD transform hardware for future video coding,” IEEE Trans. Consum.
video decoding,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Electron., vol. 63, no. 2, pp. 117–125, May 2017.
vol. 22, no. 7, pp. 1515–1525, Jul. 2014. [22] M. J. Garrido, F. Pescador, M. Chavarrías, P. J. Lobo, and C. Sanz,
[3] T. Onishi et al., “A single-chip 4 K 60-fps 4:2:2 HEVC video encoder “A high performance FPGA-based architecture for the future video cod-
LSI employing efficient motion estimation and mode decision frame- ing adaptive multiple core transform,” IEEE Trans. Consum. Electron.,
work with scalability to 8 K,” IEEE Trans. Very Large Scale Integr. vol. 64, no. 1, pp. 53–60, Feb. 2018.
(VLSI) Syst., vol. 26, no. 10, pp. 1930–1938, Oct. 2018. [23] M. J. Garrido, F. Pescador, M. Chavarrías, P. J. Lobo, C. Sanz, and P. Paz,
[4] G. He, D. Zhou, Y. Li, Z. Chen, T. Zhang, and S. Goto, “High-throughput “An FPGA-based architecture for the versatile video coding multiple
power-efficient VLSI architecture of fractional motion estimation for transform selection core,” IEEE Access, vol. 8, pp. 81887–81903, 2020.
ultra-HD HEVC video encoding,” IEEE Trans. Very Large Scale Integr. [24] M. J. Garrido, F. Pescador, M. Chavarrías, P. J. Lobo, and C. Sanz,
(VLSI) Syst., vol. 23, no. 12, pp. 3138–3142, Dec. 2015. “A 2-D multiple transform processor for the versatile video coding
[5] Q. Shang, Y. Fan, W. Shen, S. Shen, and X. Zeng, “Single-port SRAM- standard,” IEEE Trans. Consum. Electron., vol. 65, no. 3, pp. 274–283,
based transpose memory with diagonal data mapping for large size 2-D Aug. 2019.
DCT/IDCT,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, [25] A. Kammoun, S. Ben Jdidia, F. Belghith, W. Hamidouche, J. F. Nezan,
no. 11, pp. 2423–2427, Nov. 2014. and N. Masmoudi, “An optimized hardware implementation of 4-point
[6] X. Zhao, J. Chen, M. Karczewicz, L. Zhang, X. Li, and W.-J. Chien, adaptive multiple transform design for post-HEVC,” in Proc. 4th
“Enhanced multiple transform for video coding,” in Proc. Data Com- Int. Conf. Adv. Technol. Signal Image Process. (ATSIP), Mar. 2018,
press. Conf. (DCC), 2016, pp. 73–82. pp. 1–6.
[7] R. Clarke, “Relation between the Karhunen–Loève and cosine trans- [26] A. Kammoun, W. Hamidouche, F. Belghith, J.-F. Nezan, and N. Mas-
forms,” IEE Proc. F Commun., Radar Signal Process., vol. 128, moudi, “Hardware design and implementation of adaptive multiple trans-
no. 1, pp. 359–360, Nov. 1981, Accessed: Oct. 30, 2022. [Online]. forms for the versatile video coding standard,” IEEE Trans. Consum.
Available: https://digital-library.theiet.org/content/journals/10.1049/ip-f- Electron., vol. 64, no. 4, pp. 424–432, Nov. 2018.
1.1981.0061 [27] Y. Fan, Y. Zeng, H. Sun, J. Katto, and X. Zeng, “A pipelined 2D trans-
[8] J. Han, A. Saxena, V. Melkote, and K. Rose, “Jointly optimized spatial form architecture supporting mixed block sizes for the VVC standard,”
prediction and block transform for video and image coding,” IEEE IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 9, pp. 3289–3295,
Trans. Image Process., vol. 21, no. 4, pp. 1874–1884, Apr. 2012. Sep. 2020.
[9] D.-Y. Kim, “Rate-distortion optimized adaptive transform coding,” Opt. [28] Z. Zhang et al., “Fast DST-VII/DCT-VIII with dual implementation
Eng., vol. 48, no. 8, Aug. 2009, Art. no. 087004. support for versatile video coding,” IEEE Trans. Circuits Syst. Video
[10] X. Zhao, J. Chen, M. Karczewicz, A. Said, and V. Seregin, “Joint sep- Technol., vol. 31, no. 1, pp. 355–371, Jan. 2021.
arable and non-separable transforms for next-generation video coding,” [29] F. Yibo, K. Jiro, S. Heming, Z. Xiaoyang, and Z. Yixuan, “A minimal
IEEE Trans. Image Process., vol. 27, no. 5, pp. 2514–2525, May 2018. adder-oriented 1 D DST-VII/DCT-VIII hardware implementation for
[11] T. Biatek, V. Lorcy, and P. Philippe, “Transform competition for VVC standard,” in Proc. 32nd IEEE Int. System Chip Conf. (SOCC),
temporal prediction in video coding,” IEEE Trans. Circuits Syst. Video Sep. 2019, pp. 176–180.
Technol., vol. 29, no. 3, pp. 815–826, Mar. 2019. [30] Y. Zeng, H. Sun, J. Katto, and Y. Fan, “Approximated reconfigurable
[12] Versatile Video Coding, Standard ISO/IEC 23090-3, 2020. transform architecture for VVC,” in Proc. IEEE Int. Symp. Circuits Syst.
[13] R. Conceição, J. C. De Souza, R. Jeske, B. Zatt, M. Porto, and (ISCAS), May 2021, pp. 1–5.
L. Agostini, “Low-cost and high-throughput hardware design for the [31] Z. Hao, Q. Zheng, Y. Fan, G. Xiang, P. Zhang, and H. Sun, “An area-
HEVC 16 × 16 2-D DCT transform,” J. Integr. Circuits Syst., vol. 9, efficient unified transform architecture for VVC,” in Proc. IEEE Int.
no. 1, pp. 25–35, Dec. 2014. Symp. Circuits Syst. (ISCAS), May 2022, pp. 2012–2016.
[14] H. Sun, Z. Cheng, A. M. Gharehbaghi, S. Kimura, and M. Fujita, “A low- [32] Z. Hao, F. Xu, G. Xiang, P. Zhang, X. Zeng, and Y. Fan, “A multiplier-
cost approximate 32-point transform architecture,” in Proc. IEEE Int. less transform architecture with the diagonal data mapping transpose
Symp. Circuits Syst. (ISCAS), May 2017, pp. 1–4. memory for the AVS 3 standard,” in Proc. IEEE 14th Int. Conf. ASIC
[15] P. K. Meher, S. Y. Park, B. K. Mohanty, K. S. Lim, and C. Yeo, “Efficient (ASICON), Oct. 2021, pp. 1–4.
integer DCT architectures for HEVC,” IEEE Trans. Circuits Syst. Video [33] S. Chatterjee and K. Sarawadekar, “An optimized architecture of
Technol., vol. 24, no. 1, pp. 168–178, Jan. 2014. HEVC core transform using real-valued DCT coefficients,” IEEE
[16] S. Yoon Park and P. K. Meher, “Flexible integer DCT architectures for Trans. Circuits Syst. II, Exp. Briefs, vol. 65, no. 12, pp. 2052–2056,
HEVC,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2013, Dec. 2018.
pp. 1376–1379. [34] I. Farhat, W. Hamidouche, A. Grill, D. Menard, and O. Deforges,
[17] W. Zhao, T. Onoye, and T. Song, “High-performance multiplierless “Lightweight hardware implementation of VVC transform block for
transform architecture for HEVC,” in Proc. IEEE Int. Symp. Circuits ASIC decoder,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
Syst. (ISCAS), May 2013, pp. 1668–1671. (ICASSP), May 2020, pp. 1663–1667.
[18] M. Zheng, J. Zheng, Z. Chen, L. Wu, X. Yang, and N. Ling, “A recon- [35] Y. Zhang and C. Lu, “Efficient algorithm adaptations and fully
figurable architecture for discrete cosine transform in video coding,” parallel hardware architecture of H.265/HEVC intra encoder,” IEEE
IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 3, pp. 810–821, Trans. Circuits Syst. Video Technol., vol. 29, no. 11, pp. 3415–3429,
Mar. 2020. Nov. 2019.
[19] M. Masera, G. Masera, and M. Martina, “An area-efficient variable-size [36] A. Kammoun et al., “Forward-inverse 2D hardware implementa-
fixed-point DCT architecture for HEVC encoding,” IEEE Trans. Circuits tion of approximate transform core for the VVC standard,” IEEE
Syst. Video Technol., vol. 30, no. 1, pp. 232–242, Jan. 2020. Trans. Circuits Syst. Video Technol., vol. 30, no. 11, pp. 4340–4354,
[20] H. Sun, Z. Cheng, A. M. Gharehbaghi, S. Kimura, and M. Fujita, Nov. 2020.
“Approximate DCT design for video encoding based on novel truncation [37] C. U. Intel and S. Clara. (2017). Intel Arria 10 Device Overview.
scheme,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 4, Accessed: Oct. 30, 2022. [Online]. Available: https://www.altera.com/
pp. 1517–1530, Apr. 2018. documentation/sam1403480274650.html

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.

You might also like