A Reconfigurable Multiple Transform Selection Architecture For VVC
A Reconfigurable Multiple Transform Selection Architecture For VVC
5, MAY 2023
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 659
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
660 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 5, MAY 2023
Fig. 1. Transform matrices for different types with N = 8. C-II means DCT-II, respectively, C-VIII means DCT-VIII and S-VII is DST-VII.
The rest of this article is organized as follows. Section II the complexity of the transform operations. Therefore, we offer
presents the improved calculation scheme for general trans- a brief analysis of transform matrices of DCT-II, DST-VII, and
forms. In Section III, we describe the unified and reconfig- DCT-VIII in the case of N = 8 depicted in Fig. 1 to investigate
urable architecture in detail. Section IV shows the synthesis the features which are potential to design the unified hardware
results, which demonstrate the advantages of the proposed architecture.
design over existing work in the literature. Section V con- In DST-VII and DCT-VIII, two typical features can be
cludes this article. observed and have been utilized in many works of literature,
which are considered useful for a unified implementation of
II. I MPROVED C ALCULATION S CHEME FOR T RANSFORMS these two types. They are described in detail below.
A. Properties of Transforms in VVC 1) Mutual Duality of DST-VII and DCT-VIII: By comparing
the DST-VII and DCT-VIII matrices where row index starts
The primary transform types in MTS consist of DCT-II,
from 0, it is concluded that the even-index rows in two
DST-VII, and DCT-VIII. Their basis functions are shown
matrices have the same coefficients but in inverse order, and
in Table I. A common method to reduce the computational
the odd-index ones have the opposite coefficients and inverse
complexity is to decompose the 2-D transform into two 1-D
order. That is, specifically, for even-index rows
transforms according to the separability of the transforms. For
each W × H residual block, H 1-D horizontal transforms (a, b, c, d, e, f, g, h) ⇐⇒ (h, g, f, e, d, c, b, a)
are performed first to obtain the intermediate data. Next, W
DST-VII ⇐⇒ DCT-VIII
1-D vertical transforms are carried out to generate the final
transformed coefficient blocks. The types of horizontal and and for odd-index rows
vertical transforms are determined by the RDO process and
are not meant to be the same. For all types of transform, the (a, b, c, d, e, f, g, h) ⇐⇒ (−h, −g, − f, −e, −d, −c, −b, −a)
1-D transform can be calculated as follows: DST-VII ⇐⇒ DCT-VIII.
Y = TN × X T (1) 2) Self-Circulation of DST-VII and DCT-VIII: For DST-VII
where X is a 1 × N input row/column vector, TN is the N × or DCT-VIII, the coefficients in each row/column of the
N transform matrix and Y is a N × 1 output vector with the transform matrix are obtained using a unified set of coefficients
result obtained. sharing the same absolute values. Only the position and sign
The 2-D transform is conducted by two successive 1-D of each coefficient can change among different rows/columns
transforms according to the separability of triangular trans- in the matrix.
forms. This process can be summarized as follows: Such features also exist in transform matrices of different
T
sizes. It is convenient to design a reusable architecture for
Y = TN × Ytmp DST-VII and DCT-VIII by exploiting these two features. With
T
= TN × TN × X T Feature#1, the circuit of DST-VII can easily perform the
function of DCT-VIII by carrying out two simple operations:1)
= TN × X × TNT . (2)
revert the input order and 2) assign the opposite sign to the
In this case, Ytmp is the intermediate matrix that holds odd-index outputs. On the other hand, Feature#2 enables an
the result of the first 1-D transform. Between two trans- efficient implementation for DST-VII/DCT-VIII. Specifically,
forms, a transpose memory is necessary to achieve efficient we can design an SAU circuit to generate all the product
transposition of the intermediate matrix. Thus, the typical terms acquired in the transform process as presented in [27]
2-D transform hardware contains three modules: a 1-D row and [29], where the multipliers are replaced by adders and
transform module, a 1-D column transform module, and a shifts. The SAU-based transform scheme will be described in
transpose memory. Section III-B in detail.
The process of 1-D transform can be formulated as a matrix Before investigating the features in the DCT-II matrix, let’s
multiplication of an N × N transform matrix and an N × do a brief review of the well-known fast DCT-II algorithm [15]
1 input vector. The properties of transform matrices determine named Even-Odd decomposition algorithm, also known as
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 661
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
662 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 5, MAY 2023
TABLE II
R ANGE OF A BSOLUTE VALUE IN Q N /Q oN AND R N /R oN M ATRICES
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 663
Fig. 5. Rearrangement with T2C scheme for the inputs of Even-Odd decomposition circuit with different sizes and depths. (a) For N = 32, depth0. (b) For
N = 16, depth0; for N = 32, depth1. (c) For N = 8, depth0; for N = 16, depth1.
Fig. 6. Outputs of the SAU inputs selection module under the configurations of different sizes and types. (a) Cases of DST-VII/DCT-VIII. (b) Cases of
DCT-II.
and types. In order to make it reconfigurable for various and last two data are added and subtracted to obtain the
transform types, there are two datapaths available in this corresponding even vector and odd vector in each row. This
module. One is for DST-VII/DCT-VIII, and the other can be Even-Odd decompose process is illustrated in Fig. 4.
configured to perform the Even-Odd decomposition in the For reusing these adders and subtractors when processing
Butterfly algorithm Step1 for DCT-II. The only difference different sizes, the problem of data mismatch needs to be
between DST-VII and DCT-VIII is whether the data is input handled carefully. For example, for two 16-point transforms,
forward or reverse. two rows are spliced into a combined 32-point input vector.
To improve resource utilization and achieve constant Delivering the combined 32-point input vector directly to
throughput, we designed a unified interface with 32 residual the Even-Odd decomposition circuit in Fig. 4 would make
inputs. When conducting sizes less than 32, multiple rows mistakes because data in the error position are added and
or columns are combined as the input data. Therefore, this subtracted. Specifically, for X 0 , it needs to be added with X 15
reconfigurable architecture can conduct one 32-point or two to obtain e[0]. However, it would be added with X 3 in Fig. 4
16-point or four eight-point, or eight four-point transforms, and a wrong e[0] would be generated. Admittedly, designing
controlled by the size flag. different Even-Odd decomposition circuits for different sizes
In this module, the main calculation operation concentrates can be a solution to this data mismatch problem, but addi-
on the Even-Odd decomposition when conducting the DCT-II tional resource consumption is introduced. To overcome this
transform. As described in (3) and (4), multiple adders and shortcoming and improve resource utilization, we propose a
subtractors are necessary for extracting the E n O and E n E rearrangement scheme for all the combined 32-point vectors,
vectors. For eight four-point DCT-II transforms, the first two depicted in Fig. 5.
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
664 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 5, MAY 2023
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 665
Fig. 8. Diagram of an adder-tree calculator with 16 input data ports and one
output data port. The red dotted line marks the critical path, which consists Fig. 9. (a) High-frequency zeroing adjustment method. (b) Reuse scheme
of four adders. for DST-VII.
of 32 pixel/cycle, design in [27] needs to prepare 32 four- 16 transformed coefficients needn’t be calculated, and the 16th
point SAUs, 32 eight-point SAUs, 32 16-point SAUs and 32 to 31th adder-tree calculators can be reused for obtaining the
32-point SAUs. However, 32 unified SAUs are enough in our first 16 transformed coefficients. The reuse scheme is shown
design, showing obvious advantages. in Fig. 9(b) where Oi means the outputs of the previous
adder-trees selection module and Ri is the corresponding
C. Adder-Tree Calculator and Input Selection results. Based on the analysis above, the first 16 outputs of
the adder-trees selection module contain 32 product terms data
In the proposed architecture, there are 64 adder-tree cal-
and the last 16 outputs contain only 16 ones. When conducting
culators. Half of them are used for summing up product
the 32-point DST-VII, the first 16 product terms in O0 –O15
terms during the process of Q N matrix multiplications and
are delivered to adder-trees calculators A0 –A15 , and the last
the others are for R N , which ensures the overall design’s high
16 product terms in O0 –O15 are delivered to A16 –A31 . After
parallelism and pipelining ability. Each adder-tree calculator
adder-tree calculators, the results of Ai and A32−i would be
with 16 input data ports and an output port performs summing
added to get the transformed coefficients. Thanks to Feature#1,
up the 16 input data and outputs the result. Its simple diagram
the case of DCT-VIII can be well performed using this
is plotted in Fig. 8. The critical path is composed of four
reusing circuit with a simple reverse operation. For DCT-VIII,
adders according to Fig. 8.
reversing the input order from O0 –O31 to O31 –O0 can generate
The adder-tree input selection module picks out the needed
the corresponding low-frequency coefficients in R0 –R15 .
product terms for each adder-tree calculator. After the SAUs
calculator, all needed product terms are available during the
D. Outputs Selection
matrix multiplication process. According to the requirements
of different types and sizes of transform, this module selects After summing up the results of Q N and R N , the derived
corresponding data from the outputs of unified SAUs. transformed coefficients are sent to the Outputs selection mod-
However, there is a problem that needs extra attention. ule to adjust their orders and signs. The proposed architecture
That is, for DST-VII 32-point transform, the transformed designs the default datapath based on DCT-II and DST-VII.
coefficient can’t be obtained directly because 32 product Therefore, the coefficients in the odd index need to be negated
terms are selected from different unified SAUs’outputs and when conducting DCT-VIII according to Feature#1. Besides,
need at least 31 adders to obtain one coefficient, while one for DCT-II, the orders of coefficients should be adjusted
adder-tree calculator aforementioned can only sum up 16 prod- according to the Butterfly algorithm. Taking the 32-point DCT-
uct terms. For the sake of saving resources and improving II transform as an example, for the last 16 coefficients in O
reusability, we offer a solution to this problem by taking the vector as depicted in Fig. 6(b), the actual position of them
high-frequency zeroing adjustment method adopted in VVC are in the odd index form Y1 to Y31 so that arranging their
into consideration. A brief introduction to the high-frequency orders is necessary to facilitate the correct reading of later
zeroing adjustment method is described here and might clarify modules.
the efficiency of this solution.
To align the worst-case multiplications per coefficient with IV. E XPERIMENTAL AND S YNTHESIS R ESULTS
HEVC, for 64-point DCT-II and 32-point DST-VII/DCT-VIII, In this article, a unified 1-D and 2-D transform architecture
only the first 32 and 16 low-frequency coefficients are kept in are proposed and specified in Verilog HDL, which can be
VVC, respectively, the high-frequency coefficients are zeroed reconfigured to perform all transform types with rectangular
out. Fig. 9(a) depicts how the high frequency zero adjustment sizes ranging from 4 × 4 to 32 × 32 allowed in VVC
method works in DST-VII for N = 32. Specifically, for standard. Three kinds of 1-D designs are presented in this
each row or column in the 1-D transforms, the first 16 trans- article, distinguished by types to make a detailed comparison
formed coefficients {Y0 , Y1 , . . . , Y15 } are reserved, and the last with the existing state-of-the-art works: DCT-II-only, DST-
16 ones {Y16 , Y17 , . . . , Y31 } are set to zero. Therefore, the last VII/DCT-VIII-only, and all types. A 2-D architecture enabling
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
666 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 5, MAY 2023
all types is enough to illustrate the advantages of the proposed NPC indicators, the proposed DCT-II architecture still shows a
design because it is extended from 1-D architecture with the certain advantage with 2.5 versus 5.0, though a more advanced
same datapath but a larger bit-width and enjoys the same technology library contributes to the results.
area and power-consumption advantages as 1-D architecture. When it comes to DST-VII/DCT-VIII, Zhang et al. [28]
To validate the design, we export the residual data derived investigated the inherent features in DST-VII and DCT-VIII,
from intra or interprediction in VVC reference software VTM and proposed an algorithm of great significance to accelerate
16.0 as test vectors, send it to the design module, and the software calculation time. However, it is not suitable for
compare the output results with the corresponding coefficients unified hardware architecture designs because the execution
in VTM 16.0. Abundant video sequences, including Class process varies with different sizes. What’s more, unlike 1-D
A (2560 × 1600) to Class E (1280 × 720) with QP = DCT-II with properties of symmetry and anti-symmetry,
{22, 27, 32, 37} are tested, and all the results of test vectors are which can be utilized by the Butterfly algorithm to reduce
validated. the computational complexity, the 1-D DST-VII/DCT-VIII
We synthesize the proposed architecture with Design Com- architecture generally carries a larger amount of calculation
piler using TSMC 65 nm cell library and the max working and heavier resource consumption than DCT-II. This can
frequency is up to 476 MHz. The transpose memory in be observed from the Gate Count in Table IV. The main
the proposed architecture is designed based on the diagonal comparison work is [29], which was expanded into a 2-D
data mapping scheme which was also adopted in [3], [27], architecture [27]. Table IV indicates that the proposed 1-D
and [32]. Tables IV–VII present the detailed comparison with DST-VII/DCT-VIII architecture reduces the NA by 33.5%.
the comparison list of works. It is well known that stronger Though power dissipation analysis is not available in [29],
processing ability often requires a larger area and power it can be inferred that a smaller area ensures a lower NPC
consumption. For a more fair comparison, the normalized area indicator under the same technology library.
(NA) introduced in [27] and normalized power consumption A comparison of theoretical resource consumption is
(NPC) are adopted and defined as follows: described here and could explain the proposed DST-VII/DCT-
VIII architecture’s area advantage compared to [29]. Since
Gate
NA = (12) addition and shift are used to replace multiplication, the
MaxFreq × Pixel Parallelism theoretical resource comparison is carried out by estimating
Energy Consumption the number of adders. For a constant throughput of 32 pix-
NPC = (13)
MaxFreq × Pixel Parallelism els/cycle, our design needs 32 unified SAUs and 64 adder-trees
where MaxFreq is the maximum working frequency of the calculators (32 for Q N operations and 32 for R N operations).
design and selected as the operating frequency in our exper- The total number of adders is 1056 (32 × 3 + 15 × 64).
iments, Energy consumption is the corresponding power dis- In [29], SAUs of different sizes can’t be reused due to their
sipation parameter, and Gate means the equivalent number of difference, so 32 32-point SAUs, 32 16-point SAUs, 32 eight-
gates estimated from the total area and the area of two-input point SAUs and 32 four-point SAUs are necessary for their
NAND gates. Pixel parallelism indicates the number of pixels design. Therefore, its SAUs module contains 1568 adders
processed in each cycle. according to Table III (6 × 32 + 7 × 32 + 14 × 32 +
22 × 32). Additionally, they need 31 × 32 or 16 × 32 adders
to deal with the results of SAUs, which depends on whether
A. Synthesis of 1-D Transform they consider the high-frequency zero adjustment method.
Table IV gives a detailed comparison of 1-D transform Based on these analyzes, the work in [29] consumes more
schemes between the proposed architecture and the state-of- adder resources than the proposed architecture, thus leading
the-art works. Most of them only support a single kind of to a larger area than ours.
transform type, like [29], [15], [33], and [19]. The complete The work [34] is the only architecture supporting three
ASIC implementations of three transform types are few and types of transform with ASIC technology in the comparison
remain a key research issue. list. It was synthesized at a high frequency of 600 MHz
From the perspective of DCT-II, Table IV shows that the and claimed to be able to process 4 K@30 frames/s video
proposed design provides better performance in terms of sequences in real time. Its Gate Count is far less than our
resource consumption and power dissipation. The NA is only design relying on lower sample parallelism, that is, 2 pix-
9.3 thanks to various methods of circuit reusability, while els/cycle, as described in their paper. Consequently, the NA
the NA of [15], [33], and [19] are 43.8, 10.8, and 29.4, reaches up to 80.7, which is about four times that of us.
respectively. Compared to them, our design has a smaller NA To sum up, our design has achieved the best NA and NPC
and achieves area reductions of 78.8%, 14%, and 68.4%. indicators in the comparison of single transform type and all
One of the main concerns nowadays in ASIC projects is transform types, showing obvious advantages.
power dissipation/energy consumption. However, only [33] It is worth noting that, in our design, the sum of NA of DST-
offered a power analysis of 1-D transform designs. Note that VII/DCT-VIII architecture and DCT-II architecture is 24.2 and
the power estimation in [33] is done with the clock constrained is larger than the NA of all type architecture 21.6. This means
at 100 MHz instead of its max frequency of 256 MHz so that that compared with the independent design of three types of
we provide a result of synthesizing our design at 100 MHz, transform, such unified architecture is meaningful and valuable
as presented in Table V as well. As can be seen from the because it further reduces resource computation even when
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 667
TABLE IV
C OMPARISON W ITH OTHER W ORKS FOR 1-D T RANSFORM H ARDWARE I MPLEMENTATION
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
668 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 5, MAY 2023
TABLE VI
C OMPARISON W ITH OTHER W ORKS FOR 2-D T RANSFORM FPGA I MPLEMENTATION
TABLE VII are saved. Therefore, it can be concluded that with stronger
C OMPARISON W ITH OTHER W ORKS FOR 2-D T RANSFORM ASIC I MPLE - processing ability, our design achieves lower LUTs and DSPs
MENTATION
resource consumption than the state-of-the-art scheme. For
achieving a reconfigurable architecture for three types, com-
plex control logic circuits are unavoidable, which leads to
heavy consumption of FFs in our design when compared to
other works.
V. C ONCLUSION
This article proposes a reconfigurable and unified architec-
ture for MTS in VVC that can be configured to support all
transform types with square and rectangular sizes ranging from
4 × 4 to 32 × 32. To investigate the unified design, we present
a detailed analysis of the features existing in three types of
transform matrices. Based on these features, an improved cal-
culation scheme for general transform is proposed, where the
decomposition algorithm reduces the complexity of calculation
and enhances the similarity of different types of matrices.
Thanks to decomposition, an area-efficient and unified SAU is
et al. [21] claims to support the 8 K@35 frames/s video, only designed according to the range of entry values in transform
four-point and eight-point transforms are supported, which matrices, which can be highly reused by different sizes and
is not suitable for the MTS implementation in VVC. It can types of transforms. Then the SAU-based matrix multiplication
be derived that our design is capable of handling 8 K@34 contributes a lot to saving resource consumption. It can be
frames/s or 4 K@135 frames/s after a simple conversion, applied to DST-VII/DCT-VIII directly as Feature#2 described,
given 4 modes needed to be processed in the RDO of the and can also be utilized in DCT-II architecture where the T2P
encoder, which is of significant advantage. scheme ensures the function of the Butterfly algorithm and
Regarding resource consumption, 101 K look-up tables provides reusability of adders for different DCT-II sizes. Con-
(LUTs) and 47.3 K flip flops (FFs) are used in our design. sequently, we achieve a reconfigurable and unified architecture
Without multipliers, we don’t need to call any digital sig- with apparent advantages in terms of area and power compared
nal processor (DSP) resources. In the list of related works, to other ASIC- and FPGA-based implementations. Simulation
design in [36] has the strongest processing ability, therefore, and synthesis results show that the proposed architecture
we mainly compare with it. According to the structure of can support real-time UHD formats with 8 K@115 frames/s
adaptive logic modules (ALMs) in Arria 10 Systems on Chips and 4 K@135 frames/s corresponding to the ASIC-based and
(SoCs) FPGA device [37] used in [36], each ALM contains FPGA-based implementations.
six LUTs of two four-input LUTs and four three-input LUTs.
R EFERENCES
A rough conversion can be made for [36] that about 220 K
[1] B. Bross et al., “Overview of the versatile video coding (VVC) standard
LUTs will be consumed. In light of that 101 versus 220 K, and its applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 31,
our design is very competitive. Besides, 738 DSPs resources no. 10, pp. 3736–3764, Oct. 2021.
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.
HAO et al.: RECONFIGURABLE MULTIPLE TRANSFORM SELECTION ARCHITECTURE FOR VVC 669
[2] C. T. Huang, M. Tikekar, and A. P. Chandrakasan, “Memory-hierarchical [21] A. C. Mert, E. Kalali, and I. Hamzaoglu, “High performance 2D
and mode-adaptive HEVC intra prediction architecture for quad full HD transform hardware for future video coding,” IEEE Trans. Consum.
video decoding,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Electron., vol. 63, no. 2, pp. 117–125, May 2017.
vol. 22, no. 7, pp. 1515–1525, Jul. 2014. [22] M. J. Garrido, F. Pescador, M. Chavarrías, P. J. Lobo, and C. Sanz,
[3] T. Onishi et al., “A single-chip 4 K 60-fps 4:2:2 HEVC video encoder “A high performance FPGA-based architecture for the future video cod-
LSI employing efficient motion estimation and mode decision frame- ing adaptive multiple core transform,” IEEE Trans. Consum. Electron.,
work with scalability to 8 K,” IEEE Trans. Very Large Scale Integr. vol. 64, no. 1, pp. 53–60, Feb. 2018.
(VLSI) Syst., vol. 26, no. 10, pp. 1930–1938, Oct. 2018. [23] M. J. Garrido, F. Pescador, M. Chavarrías, P. J. Lobo, C. Sanz, and P. Paz,
[4] G. He, D. Zhou, Y. Li, Z. Chen, T. Zhang, and S. Goto, “High-throughput “An FPGA-based architecture for the versatile video coding multiple
power-efficient VLSI architecture of fractional motion estimation for transform selection core,” IEEE Access, vol. 8, pp. 81887–81903, 2020.
ultra-HD HEVC video encoding,” IEEE Trans. Very Large Scale Integr. [24] M. J. Garrido, F. Pescador, M. Chavarrías, P. J. Lobo, and C. Sanz,
(VLSI) Syst., vol. 23, no. 12, pp. 3138–3142, Dec. 2015. “A 2-D multiple transform processor for the versatile video coding
[5] Q. Shang, Y. Fan, W. Shen, S. Shen, and X. Zeng, “Single-port SRAM- standard,” IEEE Trans. Consum. Electron., vol. 65, no. 3, pp. 274–283,
based transpose memory with diagonal data mapping for large size 2-D Aug. 2019.
DCT/IDCT,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, [25] A. Kammoun, S. Ben Jdidia, F. Belghith, W. Hamidouche, J. F. Nezan,
no. 11, pp. 2423–2427, Nov. 2014. and N. Masmoudi, “An optimized hardware implementation of 4-point
[6] X. Zhao, J. Chen, M. Karczewicz, L. Zhang, X. Li, and W.-J. Chien, adaptive multiple transform design for post-HEVC,” in Proc. 4th
“Enhanced multiple transform for video coding,” in Proc. Data Com- Int. Conf. Adv. Technol. Signal Image Process. (ATSIP), Mar. 2018,
press. Conf. (DCC), 2016, pp. 73–82. pp. 1–6.
[7] R. Clarke, “Relation between the Karhunen–Loève and cosine trans- [26] A. Kammoun, W. Hamidouche, F. Belghith, J.-F. Nezan, and N. Mas-
forms,” IEE Proc. F Commun., Radar Signal Process., vol. 128, moudi, “Hardware design and implementation of adaptive multiple trans-
no. 1, pp. 359–360, Nov. 1981, Accessed: Oct. 30, 2022. [Online]. forms for the versatile video coding standard,” IEEE Trans. Consum.
Available: https://digital-library.theiet.org/content/journals/10.1049/ip-f- Electron., vol. 64, no. 4, pp. 424–432, Nov. 2018.
1.1981.0061 [27] Y. Fan, Y. Zeng, H. Sun, J. Katto, and X. Zeng, “A pipelined 2D trans-
[8] J. Han, A. Saxena, V. Melkote, and K. Rose, “Jointly optimized spatial form architecture supporting mixed block sizes for the VVC standard,”
prediction and block transform for video and image coding,” IEEE IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 9, pp. 3289–3295,
Trans. Image Process., vol. 21, no. 4, pp. 1874–1884, Apr. 2012. Sep. 2020.
[9] D.-Y. Kim, “Rate-distortion optimized adaptive transform coding,” Opt. [28] Z. Zhang et al., “Fast DST-VII/DCT-VIII with dual implementation
Eng., vol. 48, no. 8, Aug. 2009, Art. no. 087004. support for versatile video coding,” IEEE Trans. Circuits Syst. Video
[10] X. Zhao, J. Chen, M. Karczewicz, A. Said, and V. Seregin, “Joint sep- Technol., vol. 31, no. 1, pp. 355–371, Jan. 2021.
arable and non-separable transforms for next-generation video coding,” [29] F. Yibo, K. Jiro, S. Heming, Z. Xiaoyang, and Z. Yixuan, “A minimal
IEEE Trans. Image Process., vol. 27, no. 5, pp. 2514–2525, May 2018. adder-oriented 1 D DST-VII/DCT-VIII hardware implementation for
[11] T. Biatek, V. Lorcy, and P. Philippe, “Transform competition for VVC standard,” in Proc. 32nd IEEE Int. System Chip Conf. (SOCC),
temporal prediction in video coding,” IEEE Trans. Circuits Syst. Video Sep. 2019, pp. 176–180.
Technol., vol. 29, no. 3, pp. 815–826, Mar. 2019. [30] Y. Zeng, H. Sun, J. Katto, and Y. Fan, “Approximated reconfigurable
[12] Versatile Video Coding, Standard ISO/IEC 23090-3, 2020. transform architecture for VVC,” in Proc. IEEE Int. Symp. Circuits Syst.
[13] R. Conceição, J. C. De Souza, R. Jeske, B. Zatt, M. Porto, and (ISCAS), May 2021, pp. 1–5.
L. Agostini, “Low-cost and high-throughput hardware design for the [31] Z. Hao, Q. Zheng, Y. Fan, G. Xiang, P. Zhang, and H. Sun, “An area-
HEVC 16 × 16 2-D DCT transform,” J. Integr. Circuits Syst., vol. 9, efficient unified transform architecture for VVC,” in Proc. IEEE Int.
no. 1, pp. 25–35, Dec. 2014. Symp. Circuits Syst. (ISCAS), May 2022, pp. 2012–2016.
[14] H. Sun, Z. Cheng, A. M. Gharehbaghi, S. Kimura, and M. Fujita, “A low- [32] Z. Hao, F. Xu, G. Xiang, P. Zhang, X. Zeng, and Y. Fan, “A multiplier-
cost approximate 32-point transform architecture,” in Proc. IEEE Int. less transform architecture with the diagonal data mapping transpose
Symp. Circuits Syst. (ISCAS), May 2017, pp. 1–4. memory for the AVS 3 standard,” in Proc. IEEE 14th Int. Conf. ASIC
[15] P. K. Meher, S. Y. Park, B. K. Mohanty, K. S. Lim, and C. Yeo, “Efficient (ASICON), Oct. 2021, pp. 1–4.
integer DCT architectures for HEVC,” IEEE Trans. Circuits Syst. Video [33] S. Chatterjee and K. Sarawadekar, “An optimized architecture of
Technol., vol. 24, no. 1, pp. 168–178, Jan. 2014. HEVC core transform using real-valued DCT coefficients,” IEEE
[16] S. Yoon Park and P. K. Meher, “Flexible integer DCT architectures for Trans. Circuits Syst. II, Exp. Briefs, vol. 65, no. 12, pp. 2052–2056,
HEVC,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2013, Dec. 2018.
pp. 1376–1379. [34] I. Farhat, W. Hamidouche, A. Grill, D. Menard, and O. Deforges,
[17] W. Zhao, T. Onoye, and T. Song, “High-performance multiplierless “Lightweight hardware implementation of VVC transform block for
transform architecture for HEVC,” in Proc. IEEE Int. Symp. Circuits ASIC decoder,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
Syst. (ISCAS), May 2013, pp. 1668–1671. (ICASSP), May 2020, pp. 1663–1667.
[18] M. Zheng, J. Zheng, Z. Chen, L. Wu, X. Yang, and N. Ling, “A recon- [35] Y. Zhang and C. Lu, “Efficient algorithm adaptations and fully
figurable architecture for discrete cosine transform in video coding,” parallel hardware architecture of H.265/HEVC intra encoder,” IEEE
IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 3, pp. 810–821, Trans. Circuits Syst. Video Technol., vol. 29, no. 11, pp. 3415–3429,
Mar. 2020. Nov. 2019.
[19] M. Masera, G. Masera, and M. Martina, “An area-efficient variable-size [36] A. Kammoun et al., “Forward-inverse 2D hardware implementa-
fixed-point DCT architecture for HEVC encoding,” IEEE Trans. Circuits tion of approximate transform core for the VVC standard,” IEEE
Syst. Video Technol., vol. 30, no. 1, pp. 232–242, Jan. 2020. Trans. Circuits Syst. Video Technol., vol. 30, no. 11, pp. 4340–4354,
[20] H. Sun, Z. Cheng, A. M. Gharehbaghi, S. Kimura, and M. Fujita, Nov. 2020.
“Approximate DCT design for video encoding based on novel truncation [37] C. U. Intel and S. Clara. (2017). Intel Arria 10 Device Overview.
scheme,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 4, Accessed: Oct. 30, 2022. [Online]. Available: https://www.altera.com/
pp. 1517–1530, Apr. 2018. documentation/sam1403480274650.html
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2024 at 05:25:42 UTC from IEEE Xplore. Restrictions apply.