0% found this document useful (0 votes)

77 views11 pages

A Pipelined 8x8 2-D Forward DCT Hardware Architecture For H.264/AVC High Profile Encoder

dct_h.264

Uploaded by

Anil Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views11 pages

A Pipelined 8x8 2-D Forward DCT Hardware Architecture For H.264/AVC High Profile Encoder

dct_h.264

Uploaded by

Anil Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

A Pipelined 8x8 2-D Forward DCT Hardware

Architecture for H.264/AVC High Profile Encoder

Thaísa Leal da Silva1, Cláudio Machado Diniz1, João Alberto Vortmann2,

Luciano Volcan Agostini2, Altamiro Amadeu Susin1, and Sergio Bampi1
1
UFRGS – Federal University of Rio Grande do Sul - Microelectronics Group
Porto Alegre - RS, Brazil
{tlsilva, cmdiniz, bampi}@inf.ufrgs.br, [email protected]
2
UFPel – Federal University of Pelotas – Group of Architectures and Integrated Circuits
Pelotas - RS, Brazil
{agostini, jvortmann}@ufpel.edu.br

Abstract. This paper presents the hardware design of an 8x8 bi-dimensional

Forward Discrete Cosine Transform used in the high profiles of the H.264/AVC
video coding standard. The designed DCT is computed in a separate way as two
1-D transforms. It uses only add and shift operations, avoiding multiplications.
The architecture contains one datapath for each 1-D DCT with a transpose
buffer between them. The complete architecture was synthesized to Xilinx
Virtex II - Pro and Altera Stratix II FPGAs and to TSMC 0.35μm standard-cells
technology. The synthesis results show that the 2-D DCT transform architecture
reached the necessary throughput to encode high definition videos in real-time
when considering all target technologies.

Keywords: Video compression, 8x8 2-D DCT, H.264/AVC standard,

Architectural Design.

1 Introduction
H.264/AVC (MPEG 4 part 10) [1] is the latest video coding standard developed by
the Joint Video Team (JVT) which is formed by the cooperation between ITU Video
Coding Experts Group (VCEG) and ISO/IEC Moving Pictures Experts Group
(MPEG). This standard achieves significant improvements over the previous
standards in terms of compression rates [1].
H.264/AVC standard was firstly organized in three profiles: Baseline, Extended
and Main. A profile defines a set of coding tools or algorithms which can be used to
generate a video bitstream [2]. Each profile is targeted to specific classes of video
applications. The first version of H.264/AVC standard was focused on
"entertainment-quality" video. In July 2004, a extension was added to this standard,
called the Fidelity Range Extensions (FRExt). This extension focused on professional
applications and high definition videos [3]. Then, a new set of profiles was defined
and this set was generically called High profile, which is the focus of this work. There
are four different profiles in the High profile set, both targeting high quality videos:
High profile (HP) includes support to video with 8 bits per sample and with an YCbCr

D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 5 – 15, 2007.
© Springer-Verlag Berlin Heidelberg 2007
6 T.L. da Silva et al.

color relation of 4:2:0. High 10 profile (Hi10P) supports videos with 10 bits per
sample and also with a 4:2:0 color relation. High 4:2:2 profile (H422P) supports a
4:2:2 color relation and videos with 10 bits per sample. Finally, High 4:4:4 profile
(H444P) supports a 4:4:4 color relation (without color subsampling) and videos with
12 bits per sample.
One improvement present in High profiles is the inclusion of an 8x8 integer
transform in the forward transform module. This transform is an integer
approximation of the 8x8 2-D Discrete Cosine Transform (DCT) and it is commonly
referred as 8x8 2-D DCT in this standard [3].
This new transform is used to code luminance residues in some specific situations.
Other profiles support only 4x4 DCT transform. However, significant compression
performance gains were reported for Standard Definition (SD) and High Definition
(HD) solutions when larger than 4x4 transforms are used [4]. So, in High profiles, the
encoder can choose adaptively between the 4×4 and 8×8 transforms, when the input
data was not intra or inter predicted using sub-partitions smaller than 8x8 samples
[5][6].
Fig. 1 presents a block diagram of the H.264 encoder. The main blocks of the
encoder [7], as shown in Fig. 1, are: motion estimation (ME), motion compensation
(MC), intra prediction, forward and inverse (T and T-1) transforms, forward and
inverse quantization (Q and Q-1), entropy coding and de-blocking filter.
This work focuses on the design of an 8x8 2-D forward DCT hardware
architecture, which composes the T Block of H.264/AVC coders when the high
profile is considered. T module is highlighted in Fig. 1. This architecture was
designed without multiplications, just using shift-adds operations, aiming to reduce
the hardware complexity. Besides, the main goal of the designed architecture was to
reach the throughput to process HDTV 1080 frames (1080x1920 pixels) in real time,
allowing its use in H.264/AVC encoders targeting HDTV. This architecture was
synthesized to Altera and Xilinx FPGAs and to TSMC 0.35µm standard-cells and the
synthesis results indicated that the 2-D DCT designed in this work reaches a very high
throughput, making possible its use in a complete video coder for high resolutions
videos. We did not find any solution in the literature which presents a H.264/AVC
8x8 2-D DCT completely designed in hardware.

Current Entropy
Frame
T Q
Coder
INTER Prediction
ME

Reference
MC
Frame

INTRA
Prediction

Current
Frame Filter T-1 Q-1
(reconstructed)

Fig. 1. Block diagram of a H.264/AVC encoder

A Pipelined 8x8 2-D Forward DCT Hardware Architecture 7

This paper is organized as follows: section two presents a review of the 8x8 2-D
forward DCT transform algorithm. The third section presents the designed
architecture. Section four presents the validation strategy. The results of this work and
the discussions about these results are presented in section five. Section six presents
comparisons of this work with related works. Finally, section seven presents the
conclusions of this work.

2 8x8 2-D Forward DCT Algorithm

The 8x8 2-D forward DCT is computed in a separable way as two 1-D transforms: a
1-D horizontal transform (row-wised) and a 1-D vertical transform (column-wised).
The 2-D DCT calculation is achieved through the multiplication of three matrices as
shown in Equation (1), where X is the input matrix, Y is the transformed matrix, Cf is
the transformation matrix and CfT is the transposed of the transformation matrix. The
transformation matrix Cf is showed in Equation (2) [5][8].

Y = C f XC Tf (1)

⎡ 8 8 8 8 8 8 8 8⎤
⎢ 12 10 6 3 − 3 − 6 − 10 − 12 ⎥
⎢ ⎥
⎢ 8 4 − 4 −8 −8 − 4 4 8⎥
Cf = ⎢ 10 − 3 − 12 − 6 6 12
⎥
3 − 10⎥ 1
⎢ ⋅
⎢ 8 −8 −8 8 8 −8 −8 8⎥ 8 (2)
⎢ ⎥
⎢ 6 − 12 3 10 − 10 − 3 12 − 6⎥
⎢ 4 −8 8 −4 −4 8 −8 4⎥
⎢ ⎥
⎣⎢ 3 − 6 10 − 12 12 − 10 6 − 3⎥⎦

This transform can be calculated through fast butterfly operations accordingly to

the algorithm presented in Table 1 [5], where in denotes the vector of input values,
out denotes the transformed output vector and a and b are internal variables.

Table 1. 2-D Forward 8x8 DCT Algorithm

Step 1 Step 2 Step 3

a[0] = in[0] + in[7]; b[0] = a[0] + a[3]; out[0] = b[0] + b[1];
a[1] = in[1] + in[6]; b[1] = a[1] + a[2]; out[1] = b[4] + (b[7]>>2);
a[2] = in[2] + in[5]; b[2] = a[0] - a[3]; out[2] = b[2] + (b[3]>>1);
a[3] = in[3] + in[4]; b[3] = a[1] - a[2]; out[3] = b[5] + (b[6]>>2);
a[4] = in[0] - in[7]; b[4] = a[5] + a[6]+ ((a[4]>>1) + a[4]); out[4] = b[0] - b[1];
a[5] = in[1] - in[6]; b[5] = a[4] - a[7] - ((a[6]>>1) + a[6]); out[5] = b[6] - (b[5]>>2);
a[6] = in[2] - in[5]; b[6] = a[4] + a[7] - ((a[5]>>1) + a[5]); out[6] = (b[2]>>1) - b[3];
a[7] = in[3] - in[4]; b[7] = a[5] - a[6] + ((a[7]>>1) + a[7]); out[7] = - b[7] + (b[4]>>2);
8 T.L. da Silva et al.

This algorithm was derived from Equation (1) and it needs of three steps to
compute the 1-D DCT transform. However, in this work the algorithm presented in
[5] was modified in order to reduce the critical path of the designed architecture and
to allow a better balanced pipeline when the architecture was designed. This modified
algorithm was divided in five steps, allowing the architectural design in a five stages
pipeline.
The modified algorithm is presented in Table 2 and it computes the 1-D DCT
transform in five steps. This algorithm uses only one addition or subtraction to
generate each result, allowing the desired best balancing between the calculation
stages.

Table 2. 2-D Forward 8x8 DCT Modified Algorithm

Step 1 Step 2 Step 3

a[0] = in[0] + in[7]; b[0] = a[0] + a[3]; c[0] = b[0];
a[1] = in[1] + in[6]; b[1] = a[1] + a[2]; c[1] = b[1];
a[2] = in[2] + in[5]; b[2] = a[0] - a[3]; c[2] = b[2];
a[3] = in[3] + in[4]; b[3] = a[1] - a[2]; c[3] = b[3];
a[4] = in[0] - in[7]; b[4] = a[5] + a[6]; c[4] = b[4] + (b[8]>>1);
a[5] = in[1] - in[6]; b[5] = a[4] - a[7]; c[5] = b[5] - (b[10]>>1);
a[6] = in[2] - in[5]; b[6] = a[4] + a[7]; c[6] = b[6] - (b[9]>>1);
a[7] = in[3] - in[4]; b[7] = a[5] - a[6]; c[7] = b[7] + (b[11]>>1);
b[8] = a[4]; c[8] = b[8];
b[9] = a[5]; c[9] = b[9];
b[10] = a[6]; c[10] = b[10];
b[11] = a[7]; c[11] = b[11];

Step 4 Step 5
d[0] = c[0]; out[0] = d[0] + d[1];
d[1] = c[1]; out[1] = d[4] + (d[7]>>2);
d[2] = c[2]; out[2] = d[2] + (d[3]>>1);
d[3] = c[3]; out[3] = d[5] + (d[6]>>2);
d[4] = c[4] + c[8]; out[4] = d[0] - d[1];
d[5] = c[5] - c[10]; out[5] = d[6] - (d[5]>>2);
d[6] = c[6] - c[9]; out[6] = (d[2]>>1) - d[3];
d[7] = c[7] + c[11]; out[7] = (d[4]>>2) - d[7];

3 Designed Architecture
Based on the modified algorithm presented in Section 2, a hardware architecture for
the 8x8 2-D Forward DCT transform was designed. The architecture uses the 2-D
DCT separability property [9], where the 2-D DCT transform is computed as two 1-D
DCT transforms, one row-wised and other column-wised. The transposition is made
by a transpose buffer. The 2-D DCT block diagram is shown in Fig. 2.
The designed architecture was designed to consume and produce one sample per
clock cycle. This decision was made to allow an easy integration with the other
A Pipelined 8x8 2-D Forward DCT Hardware Architecture 9

Fig. 2. 2-D DCT Block Diagram

transforms designed in our research group for the T module of the H.264/AVC main
profile [10], which were designed with this production and consumption rates.
The two 1-D DCT modules are similar; the difference is the number of bits used in
each pipeline stage of these architectures and consequently, in the number of bits used
to represent each sample. This occurs because at each addition operation could
generate a carry out and the number of bits to represent the data increases in one bit.
Both 1-D DCT modules were designed at the same way and the input and output bit-
widths are changed for the second 1-D DCT module.
The control of this architecture was hierarchically designed and each sub-module
has its own control. A simple global control is used to start the sub-modules operation
in a synchronous way.
The designed architecture for the 8x8 1-D DCT transform is shown in Fig. 3. The
hardware architecture implements the modified algorithm presented in Section 2. It
has a five stage pipeline and it uses ping-pong buffers, adders/subtractors and
multiplexers. This architecture uses only one operator in each pipeline stage as shown
in Fig. 3.
Ping-pong buffers are two register lines (ping and pong), each register with n bits.
The data inputs serially in the ping buffer, one sample at each clock cycle. When n
samples are ready at the ping buffer, they are sent to the pong buffer in parallel [11].
There are five ping-pong buffers in the architecture and these registers are necessary
to allow the pipeline synchronization.
The 1-D DCT was the first designed module. A Finite State Machine (FSM) was
designed to control the architecture datapath.

Fig. 3. 1-D 8x8 DCT Architecture

10 T.L. da Silva et al.

A transpose buffer [11] was designed to transpose the resulting matrix from the
first 1-D DCT, generating the input matrix to the second 1-D DCT transform.
The transpose buffer is composed of two 64-word RAMs and three multiplexers
besides various control signals, as presented in Fig. 4. The RAM memories operate in
an intercalated way: while one of them is used for writing, the other one is used for
reading. Thus, the first 1-D DCT architecture writes the results line by line in one
memory (RAM1 or RAM2) and the second 1-D DCT architecture reads the input
values column by column from the other memory (RAM2 or RAM1).
The signals Wad and Rad define the address of memories and the signals Control1
and Control2 defines the read/write signal of memory. The main signals of this
architecture are also controlled by a local FSM.

Fig. 4. Transpose Buffer Architecture

Each 1-D DCT architecture has its own FSM to control its pipeline. These local
FSMs control the data synchronization among these modules.
The first 1-D DCT architecture has an 8-bit input and a 13-bit output. The
transpose buffer has a 13-bit input and output. In the second 1-D DCT architecture a
13-bit input and an 18-bit output is used. Finally, the 2-D DCT architecture has an 8-
bit input and an 18-bit output.
The two 1-D DCT architectures have a latency is of 40 clock cycles. The transpose
buffer latency is of 64 clock cycles. Then, the global 8x8 2-D DCT latency is of 144
clock cycles.

4 Architecture Validation
The reference data for validation of the designed architecture was extracted directly
from the H.264/AVC encoder reference software and ModelSim tool was used to run
the simulations.
A testbench was designed in VHDL to generate the input stimulus and to store the
output results in text files. The used input stimuli were the input data extracted from
the reference software. The first simulation considers just a behavioral model of the
designed architecture. The second simulation considers a post place-and-route model
of the designed architecture. In this step the ISE tool was used together with
A Pipelined 8x8 2-D Forward DCT Hardware Architecture 11

ModelSim to generate the post place-and-route information. The target device

selected was a Xilinx VP30 Virtex-II Pro FPGA. After some corrections in the VHDL
descriptions, the comparison between the simulations results and the reference
software results indicates no differences between them.
The designed architecture was also synthesized for standard-cells, using Leonardo
Spectrum tool, the target technology was TSMC 0.35um. After, the Modelsim tool
was used again to run new simulations considering the files generated by Leonardo
and to validate the standard-cells version of this architecture.

5 Synthesis Results
The architectures of the two 1-D DCTs and the Transpose Buffer were described in
VHDL and synthesized to Altera Stratix II EP2S15F484C3 FPGA, Xilinx VP30
Virtex II Pro FPGA and TSMC 0.35µm standard-cell technologies. These
architectures were grouped to form the 2-D DCT architecture which was also
synthesized for these target technologies.
The 2-D DCT architecture was designed to reach real time (24fps) when
processing HDTV 1080 frames and considering the HP, Hi10P and H422P profiles.
Then, color relations of 4:2:0 and 4:2:2 are allowed and 8 or 10 bits per sample are
supported. In this case, the target throughput is of 100 million of samples per second.
This section presents the synthesis results obtained considering a 2-D DCT input
bit width of 8 bits. The synthesis results of the two 1-D DCT modules, transpose
buffer module and the complete 2-D DCT targeting Altera and Xilinx FPGAs are
presented in Tables 3 and 4, respectively.
From Table 3 and Table 4 it is possible to notice the differences between the use of
hardware resources and the maximum operation frequency reached by the two 1-D
DCT modules, since the second 1-D DCT module uses a higher bit width than the
first 1-D DCT module. It is also possible to notice in both tables that the transpose
buffer uses few logic elements and reaches a high operation frequency, since it is
basically two Block RAMs and a little control.
From Table 3 it is very important to notice that the 8x8 2-D DCT uses 2,718 LUTs
of the Altera Stratix II FPGA and it reaches a maximum operation frequency
of161.66MHz. With these results this 2-D DCT is able to process 161.66 million of

Table 3. Synthesis results to Altera Stratix II FPGA

Total Logic Elements Period Throughput

Blocks
LUTs Flip Flops Mem. Bits (ns) (Msamples/s)
First 1-D DCT
1,072 877 - 5.03 198.77
Transform
Transpose Buffer 40 16 1,664 2.00 500
Second 1-D DCT
1,065 1,332 - 5.18 193.09
Transform
2-D DCT Integer
2,718 2,225 1,664 6.18 161.66
Transform
Selected Device: Stratix II EP2S15F484C3
12 T.L. da Silva et al.

samples per second. This rate is enough to process HDTV 1080 frames in real time
(24fps) when the 4:2:0 or 4:2:2 color relations are considered.
Table 4 presents the results for Xilinx Virtex II Pro FPGA and this synthesis
reported an use of 1,430 LUTs and a maximum operation frequency of 122.87MHz,
allowing a processing rate of 122.87 million of samples per second as presented in
Table 4. This processing rate is also enough to reach real time when processing
HDTV 1080 frames.

Table 4. Synthesis results to Xilinx Virtex II - Pro FPGA

Total Logic Elements Period Throughput

Blocks
LUTs Flip Flops Mem. Bits (ns) (Msamples/s)
First 1-D DCT
562 884 - 6.49 153.86
Transform
Transpose Buffer 44 17 2 2.31 432.11
Second 1-D DCT
776 1,344 - 7.09 141.02
Transform
2-D DCT Integer
1,430 2,250 2 8.13 122.87
Transform
Selected Device: Virtex II - Pro 2vp30ff896-7

Table 5 shows the synthesis results targeting TSMC 0.35µm standard-cells

technology for all designed blocks. Besides, this table emphasizes the synthesis
results of the 2-D DCT architecture including and not including the Block RAMs
synthesis. From these results it is possible to notice that the number of used gates in
the architecture with Block RAMs synthesis is almost the double of the architecture
without Block RAMs. This difference is caused because the memories were mapped
directly to register banks.
But nevertheless this architecture is able to process 124.1 million of samples per
second, also reaching the throughput to process HDTV 1080 frames in real time.
The presented synthesis results indicate that the 2-D DCT architecture designed in this
work reaches a processing rate of 24 HDTV 1080 frames per second considering all

Table 5. Synthesis results to TSMC 0.35µm standard-cells technology

Total Logic Period Throughput

Blocks
Elements (Gates) (ns) (Msamples/s)
First 1-D DCT
7,510 6.33 158.1
Transform
Transpose Buffer 15,196 4.65 215.2
Second 1-D DCT
11,230 7.58 131.9
Transform
2-D DCT Transform
(without RAM) 19,084 7.58 131.9
2-D DCT Transform
33,936 8.05 124.1
(with RAM)
A Pipelined 8x8 2-D Forward DCT Hardware Architecture 13

technology targets. This processing rate allows the use of this architecture in H.264/AVC
encoders for HP, Hi10P and H422P profiles which target high resolution videos.

6 Related Works
There are a lot of papers that present dedicated hardware designs for 8x8 2-D DCT in
the literature, but papers targeting the complete 8x8 2-D DCT defined in the
H.264/AVC High profile were not found in the literature. There are some papers
about the 4x4 2-D DCT of the H.264/AVC standard, but not about the 8x8 2-D DCT.
Only three papers were found about the High profile transforms, but not reporting
the complete hardware design 8x8 2-D DCT defined in the standard. The first work
[12] proposes a new encoding scheme to compute the classical 8x8 DCT coefficients
using error-free algebraic integer quantization (AIQ). The algorithm was described in
Verilog and synthesized for a Xilinx VirtexE FPGA. This work presented an
operation frequency of 101.5 MHz and a consumption of 1,042 LUTs, and not
presented throughput data.
The second work [13] proposes a hardware implementation of the H.264/AVC
simplified 8x8 2-D DCT and quantization. However, this work implements just the 1-
D DCT architecture and not the 8x8 2-D DCT architecture.
The comparison with the first paper [12] shows that the architecture designed in
this paper presented a higher operation frequency and a little increase in the hardware
resources consumption. A comparison in terms of throughput was not viable, once
this data not presented in [12]. The comparison with the second paper is not possible,
once it reports only an 8x8 1-D DCT and quantization design and this work presents
an 8x8 2-D DCT.
Finally, the third work [14] proposes a fast algorithm for the 8x8 2-D forward and
inverse DCT and it also proposes an architecture for this transforms. But this
architecture was not implemented in hardware, therefore, it is not possible to realize
comparisons with this work.
Other 8x8 2-D solutions presented in the literature were also compared with the
architecture presented in this paper. These other solutions are not compliant with the
H.264/AVC standard. Solutions [11], [15], [16], [17] and [18] presents hardware
implementations of the 8x8 2-D DCT using some type of approximation to use only
integer arithmetic instead of floating point arithmetic originally present in the 2-D
DCT. A comparison of our design with others, in terms of the throughput and the used
technology, is presented in Table 6. The differences between those implementations
will not be explained, as they used completely different technologies, physical
architectures and techniques to reduce area and power.
Throughputs in Table 6 show that our 8x8 2-D DCT implemented in Stratix II
surpasses all other implementations. Our standard-cells based 8x8 2-D DCT is able to
process 124 millions of samples per second and it presents the highest throughput
among the presented standard-cells designs.
Our FPGA based results could be better had we used macro function adders, that
are able to use the special fast carry chains that are present in the FPGAs.
In function of these comparisons, it is possible to conclude that the 8x8 2-D
Forward DCT architecture designed in this paper has interesting profits in relation to
other published works.
14 T.L. da Silva et al.

Table 6. Comparative results for 8x8 2-D DCT

Throughput
Design Technology
(Msamples/s)
Our Standard-cell version 0.35µm 124
Fu [15] 0.18µm 75
Agostini [11] 0.35µm 44
Katayama [16] 0.35µm 27
Hunter [17] 0.35µm 25
Chang [18] 0.6µm 23.6
Our Stratix II version Stratix II 162
Agostini [11] Stratix II 161
Our Virtex II version Virtex II 123

7 Conclusions and Future Works

This work presented the design and validation of a high performance H.264/AVC
8x8 2-D DCT architecture. The implementations details, the synthesis results
targeted to FPGA and standard-cells were also presented. This architecture was
designed to reach high throughputs and to be easily integrated with the other
H.264/AVC modules.
The modules which compose the 2-D DCT architecture were synchronized and a
constant processing rate of one sample per clock cycle is achieved. The constant
processing rate is independent of the data type and it is important to make easy the
integration of this architecture with other modules.
The synthesis results showed a minimum period of 8.13ns considering FPGAs and
a minimum period of 8.05ns considering standard-cells. These results indicate that the
global architecture is able to process 122.87 million of samples per second when
mapped to FPGAs and 124.1 million of samples per second when mapped to
standard-cells, allowing their use in H.264/AVC encoders targeting HDTV 1080 @
24 frames per second.
As future works it is planned an exploration in others design strategies for the 8x8
DCT of the H.264/AVC standard and a comparison among the obtained results. The first
design strategy to be explored is to implement other 8x8 2-D DCT transform in a parallel
fashion with a processing rate of 8 samples per clock cycle. Other future work is the
integration of this module in the Forward Transform module of the H.264/AVC encoder.

References
1. Joint Video Team of ITU-T, and ISO/IEC JTC 1: Draft ITU-T Recommendation and Final
Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 or ISO/IEC
14496-10 AVC). JVT Document, JVT-G050r1 (2003)
2. Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC
Video Coding Standard. IEEE Transactions on Circuits and Systems For Video
Technology 13, 560–576 (2003)
A Pipelined 8x8 2-D Forward DCT Hardware Architecture 15

3. Sullivan, G.J., Topiwala, P.N., Luthra, A.: The H.264/AVC Advanced Video Coding
Standard: Overview and Introduction to the Fidelity Range Extensions. In: SPIE
Conference on Application of Digital Image Processing, Denver, CO, vol. XXVII (5558),
pp. 454–474 (2004)
4. Gordon, S., Marpe, D., Wiegand, T.: Simplified Use of 8x8 Transforms. JVT Document,
JVT-I022 (2004)
5. Gordon, S., Marpe, D., Wiegand, T.: Simplified Use of 8x8 Transforms - Updated
Proposal & Results. JVT Document, JVT-K028 (2004)
6. Marpe, D., Wiegand, T., Gordon, S.: H.264/MPEG4-AVC Fidelity Range Extensions: Tools,
Profiles, Performance, and Application Areas. In: International Conference on Image
Processing, ICIP 2005, Genova, Italy, vol. 1, pp. 593–596 (2005)
7. Richardson, I.E.G.: H.264 and MPEG-4 Video Compression - Video Coding for Next-
Generation Multimedia. John Wiley & Sons, Chichester, UK (2003)
8. Malvar, H.S., Hallapuro, A., Karczewicz, M., Kerofsky, L.: Low-Complexity Transform
and Quantization in H.264/AVC. IEEE Transactions on Circuits and Systems for Video
Technology 13, 598–603 (2003)
9. Bhaskaran, V., Konstantinides, K.: Image and Video Compression Standards: Algorithms
and Architectures, 2nd edn. Kluwer Academic Publishers, Norwell, MA (1997)
10. Agostini, L.V., Porto, R.E.C., Bampi, S., Rosa, L.Z.P., Güntzel, J.L., Silva, I.S.: High
Throughput Architecture for H.264/AVC Forward Transforms Block. In: Great Lake
Symposium on VLSI, GLSVLSI 2006, New York, NY, pp. 320–323 (2006)
11. Agostini, L.V., Silva, T.L., Silva, S.V., Silva, I.S., Bampi, S.: Soft and Hard IP Design of a
Multiplierless and Fully Pipelined 2-D DCT. In: International Conference on Very Large
Scale Integration, VLSI-SOC 2005, Perth, Western Australia, pp. 300–305 (2005)
12. Wahid, K., Dimitrov, V., Jullien, G.: New Encoding of 8x8 DCT to make H.264 Lossless.
In: Wahid, K., Dimitrov, V., Jullien, G. (eds.) Asia Pacific Conference on Circuits and
Systems, APCCAS 2006, Singapore, pp. 780–783 (2006)
13. Amer, I., Badawy, W., Jullien, G.: A High-Performance Hardware Implementation of the
H.264 Simplified 8X8 Transformation and Quantization. In: International Conference on
Acoustics, Speech, and Signal Processing, ICASSP 2005, Philadelphia, PA, vol. 2, pp.
1137–1140 (2005)
14. Fan, C.-P.: Fast 2-D Dimensional 8x8 Integer Transform Algorithm Design for
H.264/AVC Fidelity Range Extensions. IEICE Transactions on Informatics and
Systems E89-D, 2006–3011 (2006)
15. Fu, M., Jullien, G.A., Dimitrov, V.S., Ahmadi, M.: A Low-Power DCT IP Core Based on
2D Algebraic Integer Encoding. In: International Symposium on Circuits and Systems,
ISCAS 2004, Vancouver, CA, vol. 2, pp. 765–768 (2004)
16. Katayama, Y., Kitsuki, T., Ooi, Y.: A Block Processing Unit in a Single-Chip MPEG-2
Video Encoder LSI. In: Workshop on Signal Processing Systems, Shanghai, China, pp.
459–468 (1997)
17. Hunter, J., McCanny, J.: Discrete Cosine Transform Generator for VLSI Synthesis. In:
International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1998,
Seattle, WA, vol. 5, pp. 2997–3000 (1998)
18. Chang, T.-S., Kung, C.-S., Jen, C.-W.: A Simple Processor Core Design for DCT/IDCT.
IEEE Transactions on Circuits and Systems for Video Technology 10, 439–447 (2000)

Decode To Encode
No ratings yet
Decode To Encode
232 pages
A Survey On Perceptually Optimized Video Coding
No ratings yet
A Survey On Perceptually Optimized Video Coding
36 pages
Bit-Plane Decomposition Matrix-Based VLSI
No ratings yet
Bit-Plane Decomposition Matrix-Based VLSI
57 pages
Jpeg (Bruna)
No ratings yet
Jpeg (Bruna)
36 pages
Embedded Intro
No ratings yet
Embedded Intro
69 pages
H264/AVC Video Coding Standard: Nhóm 8: Nguyễn Hà Thu Nguyễn Tiến Thành
No ratings yet
H264/AVC Video Coding Standard: Nhóm 8: Nguyễn Hà Thu Nguyễn Tiến Thành
59 pages
The H 264 AVC Advanced Video Coding Stan
No ratings yet
The H 264 AVC Advanced Video Coding Stan
22 pages
A Reconfigurable Multiple Transform Selection Architecture For VVC
No ratings yet
A Reconfigurable Multiple Transform Selection Architecture For VVC
12 pages
Core Transform Design in The High Efficiency Video Coding HEVC Standard
No ratings yet
Core Transform Design in The High Efficiency Video Coding HEVC Standard
13 pages
Fast Calculation of 8 8 Integer DCT in The Software Implementation of H.264/Avc
No ratings yet
Fast Calculation of 8 8 Integer DCT in The Software Implementation of H.264/Avc
9 pages
High Performance 2D Transform Hardware For Future Video Coding
No ratings yet
High Performance 2D Transform Hardware For Future Video Coding
9 pages
Design of Control Unit
100% (1)
Design of Control Unit
19 pages
Fpga Arch FVC Amt
No ratings yet
Fpga Arch FVC Amt
7 pages
Hardware-Efficient 2D-DCT IDCT Architecture For Portable HEVC-Compliant Devices
No ratings yet
Hardware-Efficient 2D-DCT IDCT Architecture For Portable HEVC-Compliant Devices
10 pages
Baldev 2018
No ratings yet
Baldev 2018
9 pages
PVS 800 Inverter Falut Tracing
No ratings yet
PVS 800 Inverter Falut Tracing
3 pages
IET Image Processing - 2015 - Pastuszak - Hardware Architectures For The H 265 HEVC Discrete Cosine Transform
No ratings yet
IET Image Processing - 2015 - Pastuszak - Hardware Architectures For The H 265 HEVC Discrete Cosine Transform
11 pages
Presentation (1) - Read-Only
No ratings yet
Presentation (1) - Read-Only
17 pages
Potluri 2014
No ratings yet
Potluri 2014
14 pages
Artigo Científico
No ratings yet
Artigo Científico
6 pages
Wenjunzhao 2013
No ratings yet
Wenjunzhao 2013
4 pages
1 s2.0 S1434841116309037 Main
No ratings yet
1 s2.0 S1434841116309037 Main
8 pages
Processing: SIGN
100% (1)
Processing: SIGN
418 pages
VHDL Implementation of H264 Video Coding Standard
No ratings yet
VHDL Implementation of H264 Video Coding Standard
8 pages
Vlsi Implementation of Integer DCT Architectures For Hevc in Fpga Technology
No ratings yet
Vlsi Implementation of Integer DCT Architectures For Hevc in Fpga Technology
12 pages
Jpeg PPT Notes
No ratings yet
Jpeg PPT Notes
24 pages
Performance Enhancement of Video Compression Algorithms With SIMD
No ratings yet
Performance Enhancement of Video Compression Algorithms With SIMD
80 pages
A Multitransform Architecture For H.264/AVC High-Profile Coders
No ratings yet
A Multitransform Architecture For H.264/AVC High-Profile Coders
11 pages
h.265 Hevc Tutorial 2014 Iscas
No ratings yet
h.265 Hevc Tutorial 2014 Iscas
131 pages
Gupta 2016
No ratings yet
Gupta 2016
5 pages
Asic Based DCT2016
No ratings yet
Asic Based DCT2016
5 pages
Serial Parallel Dataflow-Pipelined Processing Architecture Based Accelerator For 2D Transform-Quantization in Video Coder and Decoder
No ratings yet
Serial Parallel Dataflow-Pipelined Processing Architecture Based Accelerator For 2D Transform-Quantization in Video Coder and Decoder
12 pages
Survey 1
No ratings yet
Survey 1
10 pages
Efficient Area and Delay Integer DCT Architecture Using Modified Transbuffer Implemented On Fpga
No ratings yet
Efficient Area and Delay Integer DCT Architecture Using Modified Transbuffer Implemented On Fpga
5 pages
H.264 MPEG-4 Part 10 White Paper
No ratings yet
H.264 MPEG-4 Part 10 White Paper
9 pages
Artigo Científico
No ratings yet
Artigo Científico
4 pages
Wa0004.
No ratings yet
Wa0004.
3 pages
Lec8 - Transform Coding (JPG)
No ratings yet
Lec8 - Transform Coding (JPG)
39 pages
Video Compression Using H 264 Standard
No ratings yet
Video Compression Using H 264 Standard
4 pages
Algoritma h264 PDF
No ratings yet
Algoritma h264 PDF
16 pages
JPEG and H.26x Standards
No ratings yet
JPEG and H.26x Standards
30 pages
A Dynamically Reconfigurable VLSI Architecture For H.264 Integer Transforms
No ratings yet
A Dynamically Reconfigurable VLSI Architecture For H.264 Integer Transforms
5 pages
h264 Transform
No ratings yet
h264 Transform
9 pages
H.264 MPEG4-AVC Fidelity Range Extension
No ratings yet
H.264 MPEG4-AVC Fidelity Range Extension
4 pages
Two Dimensional DCTIDCT Architecture 2001
No ratings yet
Two Dimensional DCTIDCT Architecture 2001
29 pages
32 DCT
No ratings yet
32 DCT
57 pages
H.264/MPEG4-AVC Fidelity Range Extensions: Tools, Profiles, Performance, and Application Areas
No ratings yet
H.264/MPEG4-AVC Fidelity Range Extensions: Tools, Profiles, Performance, and Application Areas
4 pages
H.265 High Efficiency Video Coding (HEVC) : Presented by
100% (1)
H.265 High Efficiency Video Coding (HEVC) : Presented by
29 pages
Fast Block Direction Prediction For Directional Transforms
No ratings yet
Fast Block Direction Prediction For Directional Transforms
8 pages
Integrated Digital Architecture For JPEG Image Compression: Luciano Agostini and Sergio Bampi
No ratings yet
Integrated Digital Architecture For JPEG Image Compression: Luciano Agostini and Sergio Bampi
4 pages
A Hybrid Transformation Technique For Advanced Video Coding: M. Ezhilarasan, P. Thambidurai
No ratings yet
A Hybrid Transformation Technique For Advanced Video Coding: M. Ezhilarasan, P. Thambidurai
7 pages
High-Performance Hardware Implementation of The H
No ratings yet
High-Performance Hardware Implementation of The H
4 pages
G Nageshwara Reddy - 13MVD1036
No ratings yet
G Nageshwara Reddy - 13MVD1036
8 pages
DCT Thesis
No ratings yet
DCT Thesis
12 pages
Subramanian 2010
No ratings yet
Subramanian 2010
4 pages
High-Efficiency and Low-Power Architectures For 2-D DCT and IDCT Based On CORDIC Rotation
No ratings yet
High-Efficiency and Low-Power Architectures For 2-D DCT and IDCT Based On CORDIC Rotation
6 pages
Algorithm and Architecture Design of The H.265HEVC Intra Encoder
No ratings yet
Algorithm and Architecture Design of The H.265HEVC Intra Encoder
6 pages
H.264/ AVC: Compression Standard
No ratings yet
H.264/ AVC: Compression Standard
21 pages
Image Compression Using High Efficient Video Coding (HEVC) Technique
No ratings yet
Image Compression Using High Efficient Video Coding (HEVC) Technique
3 pages
A Hybrid Transformation Technique For Advanced Video Coding: M. Ezhilarasan, P. Thambidurai
No ratings yet
A Hybrid Transformation Technique For Advanced Video Coding: M. Ezhilarasan, P. Thambidurai
7 pages
Second Order Effects
100% (2)
Second Order Effects
40 pages
Kia ED
100% (1)
Kia ED
22 pages
Prova 100 Manual
No ratings yet
Prova 100 Manual
14 pages
Efficient Implementation of Low Power 2-D DCT Architecture
No ratings yet
Efficient Implementation of Low Power 2-D DCT Architecture
6 pages
Parallela Cluster by Michael Johan Kruger
No ratings yet
Parallela Cluster by Michael Johan Kruger
56 pages
Low Power DCT Architecture For Image/Video Coders: IPASJ International Journal of Electronics & Communication (IIJEC)
No ratings yet
Low Power DCT Architecture For Image/Video Coders: IPASJ International Journal of Electronics & Communication (IIJEC)
10 pages
BIOS and DOS Interrupts
83% (6)
BIOS and DOS Interrupts
42 pages
Bobina Tesla FORTE SSTCusing555timer
No ratings yet
Bobina Tesla FORTE SSTCusing555timer
7 pages
Anna University Trichy B.Tech Information Technology Syllabus
No ratings yet
Anna University Trichy B.Tech Information Technology Syllabus
84 pages
Appendix 2a - Decibels and Signal Strength
100% (3)
Appendix 2a - Decibels and Signal Strength
2 pages
Datasheet Acs 120 PDF
No ratings yet
Datasheet Acs 120 PDF
11 pages
1.3.2 D Type Flip Flops
No ratings yet
1.3.2 D Type Flip Flops
54 pages
Maintenance PDF
No ratings yet
Maintenance PDF
31 pages
Performance of A Computer
No ratings yet
Performance of A Computer
83 pages
BLOWN UP - Network
No ratings yet
BLOWN UP - Network
3 pages
DB Bluelog XM XC en 20240125
No ratings yet
DB Bluelog XM XC en 20240125
3 pages
E2001 Circuit Analysis: Academic Year 2020-2021
No ratings yet
E2001 Circuit Analysis: Academic Year 2020-2021
15 pages
Wistron Sjv50 TR
No ratings yet
Wistron Sjv50 TR
59 pages
Hamamatsu Opto-Semiconductor Modules
No ratings yet
Hamamatsu Opto-Semiconductor Modules
28 pages
Unit 1 Single-Phase Transformer Tutorial
No ratings yet
Unit 1 Single-Phase Transformer Tutorial
15 pages
Optoelectronic Devices
No ratings yet
Optoelectronic Devices
10 pages
TeSys K Contactors - LC1K1210M7
No ratings yet
TeSys K Contactors - LC1K1210M7
4 pages
Expt-8-Elements of Electronics Engineering
No ratings yet
Expt-8-Elements of Electronics Engineering
12 pages
EcE 52013 Equations
No ratings yet
EcE 52013 Equations
10 pages
Product Specifications Product Specifications: HBXX HBXX - 3817TB1 3817TB1 - A2M A2M
No ratings yet
Product Specifications Product Specifications: HBXX HBXX - 3817TB1 3817TB1 - A2M A2M
4 pages
Synopsys On Digital Thermometer Using ATMEGA 16
No ratings yet
Synopsys On Digital Thermometer Using ATMEGA 16
3 pages
MMSZ5257BQ 7 F PDF
No ratings yet
MMSZ5257BQ 7 F PDF
4 pages
EE210 Assignment4 Solution
No ratings yet
EE210 Assignment4 Solution
3 pages
Analog Circuit Design Engineer
No ratings yet
Analog Circuit Design Engineer
1 page
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet

A Pipelined 8x8 2-D Forward DCT Hardware Architecture For H.264/AVC High Profile Encoder

Uploaded by

A Pipelined 8x8 2-D Forward DCT Hardware Architecture For H.264/AVC High Profile Encoder

Uploaded by

A Pipelined 8x8 2-D Forward DCT Hardware

Architecture for H.264/AVC High Profile Encoder

Thaísa Leal da Silva1, Cláudio Machado Diniz1, João Alberto Vortmann2,

Abstract. This paper presents the hardware design of an 8x8 bi-dimensional

Keywords: Video compression, 8x8 2-D DCT, H.264/AVC standard,

Fig. 1. Block diagram of a H.264/AVC encoder

2 8x8 2-D Forward DCT Algorithm

This transform can be calculated through fast butterfly operations accordingly to

Table 1. 2-D Forward 8x8 DCT Algorithm

Step 1 Step 2 Step 3

Table 2. 2-D Forward 8x8 DCT Modified Algorithm

Step 1 Step 2 Step 3

Fig. 2. 2-D DCT Block Diagram

Fig. 3. 1-D 8x8 DCT Architecture

Fig. 4. Transpose Buffer Architecture

ModelSim to generate the post place-and-route information. The target device

Table 3. Synthesis results to Altera Stratix II FPGA

Total Logic Elements Period Throughput

Table 4. Synthesis results to Xilinx Virtex II - Pro FPGA

Total Logic Elements Period Throughput

Table 5 shows the synthesis results targeting TSMC 0.35µm standard-cells

Table 5. Synthesis results to TSMC 0.35µm standard-cells technology

Total Logic Period Throughput

Table 6. Comparative results for 8x8 2-D DCT

7 Conclusions and Future Works

You might also like