0% found this document useful (0 votes)
4 views9 pages

Fast Algorithm

This paper presents a new fast algorithm for computing the 2-D discrete cosine transform (DCT) that significantly reduces the number of multiplications required compared to conventional methods. The proposed algorithm utilizes only N 1-D DCTs for an N x N DCT, where N = 2^m, resulting in a hardware implementation that requires only a quarter of the multipliers needed for traditional approaches. The algorithm is shown to be efficient for VLSI implementation and maintains a systematic computation structure.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

Fast Algorithm

This paper presents a new fast algorithm for computing the 2-D discrete cosine transform (DCT) that significantly reduces the number of multiplications required compared to conventional methods. The proposed algorithm utilizes only N 1-D DCTs for an N x N DCT, where N = 2^m, resulting in a hardware implementation that requires only a quarter of the multipliers needed for traditional approaches. The algorithm is shown to be efficient for VLSI implementation and maintains a systematic computation structure.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 38, NO.

3, MARCH 1Y91 291

Fast Algorithm and Implementation of 2-D


Discrete Cosine Transform
Nam Ik Cho, Student Member, IEEE, and Sang Uk Lee, Member, IEEE

Abstract -In this paper, a new algorithm for the fast computation of a for the larger size transforms, the 2-D recursive technique
2-D discrete cosine transform (DCT) is presented. It is shown that the employing the 4X 4 DCT [17] requires more multiplicationo
N X N DCT, where N = 2m, can be computed using only N 1-D DCT's than those of [15] and [161.
and additions, instead of using 2 N 1-D DCT's, as in the conventional
row-column approach. Hence the total number of multiplications for In this paper, a new fast 2-D DCT algorithm, which may
the proposed algorithm is only half of that required for the row-column be viewed as a modification and generalization of the 4 X 4
approach, and is also less than that of most of other fast algorithms, DCT [17], is proposed. It will be shown that the proposed
while the number of additions is almost comparable to that of others. It algorithm requires only N 1-D DCT's for the computation of
is also shown that only N / 2 1-D DCT modules are required for the N X N DCT, where N = 2". Hence the number of
hardware parallel implementation of the proposed algorithm. Thus the
number of actual multipliers being used is only a quarter of that multiplications required for the proposed algorithm is only
required for the conventional approach. half of that required for the conventional approach, which is,
in fact, the same number of multiplications as reported in
[16]. However, as compared with Duhamel's algorithm [16],
the proposed algorithm has advantage in that the computa-
I. INTRODUCTION tion structure is highly regular and systematic, and only real

S INCE D C T approaches the statistically optimal


Karhunen-Loeve transform (KLT) for highly correlated
signals, it is widely used in digital signal processing, espe-
arithmetic is required. Also, we shall show that N / 2 1-D
DCT modules are sufficient for the hardware parallel imple-
mentation of the algorithm. Hence the number of actual
cially for speech and image data compression [l], [2]. Thus multipliers being used in hardware implementation is a quar-
many algorithms and VLSI architectures for the fast compu- ter of that required for the conventional approach. Thus the
tation of DCT have been proposed [3]-[lo]. For the fast proposed 2-D DCT algorithm is very suitable for the VLSI
computation of 2-D DCT, the conventional approach is the implementation.
row-column method. This method requires 2 N I - D DCT's The rest of the paper is organized as follows. In Section 11,
for the computation of the N X N DCT. However, for we will introduce a new fast 2-D DCT algorithm along with
hardware parallel implementation of the conventional ap- an examples for 8 x 8 DCT. Also, the examples for 4 x 4
proach, a complicated matrix transposition architecture as DCT, 8 x 8 inverse DCT (IDCT), and 4 x 4 IDCT are pro-
well as 2 N 1-D DCT modules is required. Thus for more vided. The comparison of the number of multiplications and
efficient computation or parallel implementation of the 2-D additions with other fast algorithms [14]-[16] is also given in
DCT, the algorithms that work directly on the 2-D data set this section. In Section 111, we shall discuss the parallel
have been introduced [ 111-[ 161. The most efficient 2-D DCT implementation of the algorithm. Finally, in Section IV, we
algorithm appeared in the literature is the direct polynomial give conclusions.
approach proposed by Duhamel[16], in which the number of
multiplications is reduced to 50% of the conventional ap- 11. THEFASTALGORITHM
A N D ITS
proach. On the other hand, the algorithms in [13] and [14] PARALLEL
IMPLEMENTATION
require 75%, and the indirect approach using the polynomial
transform FFT and rotation proposed by Vetterli [151 re- In this section, we shall describe a new fast algorithm for
quires between 50% and 75% of the conventional approach. 2-D DCT that requires only half the number of multiplica-
More recently, a fast algorithm for the 4 x 4 DCT is pro- tions compared to the conventional row-column method.
posed [17]. This algorithm is restricted to the size of 4 x 4 Also, we shall provide examples for 8 x 8 and 4 x 4 DCT's.
transform, because the derivation is very complicated to be The examples for IDCT are also given.
generalized to any 2" x 2 " cases, where m is a positive
integer. Hence it is useful for the computation of larger size A . A Fast 2-0 DCT Algorithm
transforms only by incorporating with the recursive 2-D DCT For a given 2-D data sequence { x j j : i,J = 0,l; . ., N - 11,
technique [141. In the case of the 4 x 4 DCT, it requires the the 2-D DCT sequence {Ynln:m ,n = 0,l; . ., N - 1) is given
same number of multiplications as in [15] and [16]. However,
by
4
Manuscript received April 25, 1990. This paper was recommended by Ymn = i u ( m > m (n )
Associate Editor T. R. Hsing. N
The authors are with the Department of Control and Instrumentation
Engineering, Seoul National University, Seoul 151-742, Korea.
IEEE Log Number 9041758.

0098-4094/91/0300-0297$01 .OO 01991 IEEE


I

298 IEEE TRANSACTIONSON CIRCUITS AND SYSTEMS, VOL. 38, NO. 3, MARCH 1991

where +
condition, we can see that ( 2j 1) should be either a multi-
+
ple of (2i 1) modulo 2 N or 2 N minus a multiple of (2i 1) +
u( m ) =
i::". m=O,
otherwise.
mudulo 2 N , i.e.,

( 2 j + l ) = p ( 2 i + l ) mod2N (7.a)
We will neglect the scale factor 4 u ( m ) u ( n ) / N 2 for conve-
nience. Then, let us define a denormalized form of y,, as or

N-l N-l (2i + 1)m ( 2 j + 1)n ( 2 j + 1) = 2 N - p ( 2 i + 1) mod 2 N (7.b)


Ymn= C C xijcos
;=o j=o 2N n-cos ____2N
T (2.a)
where p is an odd integer ranging from 1 to N - 1, because
the value of p out of this range yields the same value of j as
i.e.. one of those produced by the p in the range. The relations
in (7) are equivalent to
4
Y,, = -2 u ( m )U ( n 1. Ymn. (2.b) j = p i + ( p - 1 ) / 2 mod N (8-a)
The main idea behind the 4 x 4 algorithm proposed in [17] or
is that the 4 x 4 DCT can be decomposed into four separate j=N-l-pi-(p-l)/2 mod N ,
four-point 1-D DCT's by using the following relations:
for p = 1,3; .., N - 1 . (8.b)
(2i+I)m (2j+I)n
cos
2N
T cos
2N
n- - It can be easily shown that when i ranges from 0 to N - 1,
N sequences for j obtained by (8) are mutually different.
(2i + 1 ) m + ( 2 j + 1). Thus the 2-D input data set can be grouped into N different
= 12 (cos 2N
7r data sets, each of whose indexes satisfies the relations in (8).
Then, we can see that the kernel of the transforms for each
( 2 i + 1 ) m - ( 2 j + 1)n of these data sets is equivalent to that of 1-D DCT. To
+ cos 2N T). (3) distinguish each of the sequences of j obtained by (8.a) or
(8.b) for p = 1,3,5,. . ., N - 1, let us denote them as
In this paper, we shall make use of the above relation for
the computation of the N X N DCT, where N > 8. Using j ( p ; a ) = p i + ( p - 1 ) / 2 mod N (9.a)
the relation in (3), the N X N DCT can be separated into or
two transforms as given by
j ( p ; b ) = N - 1 - pi - ( p - 1 ) / 2 mod N ,
N-1 N-l (2i+I)m+(2j+I)n
T forp=1,3;..,N-l,and i=0,1,2;*.,N-l. (9.b)
y,,=1/2{ i = o j = o xjjcos 2N
That is, for given p , { j ( p ; a ) :i = 0,l; . ., N - 1) is the se-
N-1 N-1
( 2 i + 1 ) m - ( 2 j + 1). quence of j obtained by (8.a) and { j ( p ;b): i = 0,1,. . ., N - l }
+ xjjcos is the sequence of j obtained by (8.b). Hence, by grouping
i=o j-0 2N the 2-D input sequence {xlj: i , j = 0,1,2; . ., N - 1) into N
1-D sequences {xij(p;a):i = 0,1,2; . N - 1) and {xij(p;b):
e ,

for m , n = 1 , 1 , 2 ; . . , N - l . (4) i = 0 , 1 , 2 ; . . , N - l ) for p = 1 , 3 , 5 ; . . , N - l , the 1-D trans-


forms in ( 5 ) can be expressed as sum of 1-D DCT's. We will
For convenience, by defining new transforms A,,
as
and B,, denote these I-D data sequences by R i and Ri, respectively.
Then, they can be expressed as
N-1 N-l ( 2 i + l ) m + ( 2 j + 1).
A,,= c
;=o j=o
xijcos
2N
n- (5.a) R; ={ x ~ ~ ( i~=; 0,1,2;.
~): ., N -1,

j ( P ; a ) = p i + ( p - 1 ) / 2 mod N I . (l0.a)
N-1 N - l (2i+I)m-(2j+I)n
B,, = xijcos T (5.b)
i=o j - 0 2N R:={Xjj(p;b): i = OJ,2,. ' ' 7 N -1,
y,, can be rewritten as j(p;b)=N-l-pi-(p-1)/2 mod N } ,

Ym, = 1/2(Am, + B m n ) . for p = 1,3,5;. N -1. (lO.b)


( 6) 1,

Now, we shall show that A,, and B,, can be expressed in However, for the proof of which we are in pursuit, it is
terms of N 1-D DCT's by some data ordering and manipula- necessary to know the exact result of pi + ( p - 1)/2 divided
tions, SO that the N x N DCT can be obtained from N by N , while only the remainder of the division can be
separate I-D DCT's. It is noted that the condition for the perceived from (10). In other words, we need to know the
kernels of the transforms in (5) to be equivalent to that of quotient of the division as well as the remainder. Hence, by
+ +
1-D DCT's is that {(2i 1)m +(2j 1)n) should be expressed introducing a new integer sequence qpi, which is a quotient
as (2i+1) multiplied by some integer. To satisfy such a of pi + ( p - 1)/2 divided by N , we can rewrite (10) (without
CHO AND LEE: FAST ALGORITHM 299

"mod") as Then, from (6), y,, can be rewritten as


N-1
R; = { x ~ ~ ( i~=; O,1,2;
~ ) : . . ,N - I , ymn= c
p=l
1/2{T,"(m,n)+T,b(m,n)
j ( P ;a ) = Pi + ( P - 1)/2 - m,,} (1l.a) ( p : odd)

RE=(Xjj(p;b):i = O , 1 , 2 , ' " , N - I ,


+ S ; ( m , n ) + S,b(m,n)}. (16)
Thus, in order to show that ymn is the summation of 1-D
j ( p ; b ) = N - 1- pi - ( p - 1)/2 + Nq,;), DCT's, it remains to show that T:(m, n ) , T,b(m,n), S$m, n ) ,
and S,b(m,n) can be expressed in terms of 1-D DCT's. In
for = 1,3,5,. . . , - 1. (ll.b)
order to do so, by substituting the relation in (1l.a) into
As an example, for N = 8, since p has the value of 1, 3, 5, (15.a), we have
and 7, the 8 X 8 2-D data set can be grouped into Tpa(m,n)
Rq = ('00, '113 '227 '339 '44, ' 5 5 , '66, '77) (12.a) N-l (2i + l ) m + ( p ( 2 i + 1) -2Nqp,}n x
= Xij(p;a)COS
RP = ( '07 > ' 16 9 '25 > '34 > '43 9 '52 9 '61 9 '70) ( 12.b) i=O
2N
but it can be separated into two cases where n is even or
R ~ = ( x ~ ~ ~ X 1 4 ~ X 2 7 ~ X ~ 2 ~ X 4 ~ ~ X ~ 0(12")
~ X 6 3 ~ odd,
X 7 6 i.e.,
)

C Xij(p:a)COS
i=O 2N
x,

when n is even (17.a)

i
=
R; = ('03, '12, '21, '30, '47, '567 x65, '74) (12.g)
(2i+l)(m+np)
R$ = {'M? '15, '267 x 4 0 , '51 x62 ? '731 ( 12'h) c(
N-l

r=O
- l)Y""l/(p,apS 2N
n-?

where the quotient sequences qp,'s for p = 1, 3, 5, and 7 are


411 = (0,0,0,0,0,0,0,0~ (13.a) when n is odd. (17.b)

(13.b) Also, substitution of the relation in (1l.b) into (15.b) leads to


q3, = (0,0,0,1,1,2,2,2}

N-1 (2i+l)(m -np)+2N(l+qp,)n


= Xij(p:b)CoS 77
i=O
2N
but this is also expressed separately depending on n , i.e.,

I
N-1 (2i+l)(m-pn)
'ij(p;b)'OS 77,
i =0 2N

when n is even (1S.a)


T,b(m,n)=

(2i+l)m+(2j+l)n
t when n is odd. (18.b)

Tp"(m,n)= XI/ cos n- (15.a) In the same way, by substituting (1l.a) into (15.c), we have
X,, E R;
2N

(2i +l)m +(2j+l)n C Xij(p;a)COS


i=O 2N
TTT,

T,h(m,n) = x,/cos x (15.b)


2N
XZJ E Rf,
when n is even (19.a)
(2i + 1 ) m - ( 2 j + 1)" S,.(m,n) = 1

S;(m,n)=
x#J E R:
x,,cos
2N
n- (15.c)
(2i + 1 ) (m - np)
=TT,
2N
(2i+I)m-(2j+l)n
S,b(m,n)=
XtJ E R,h,
xijcos
2N
77. (15.d)
I when n is odd. (19.b)

-m-
I

300 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 38, NO. 3, MARCH 1991

Also, by substituting (1l.b) into (15.d), we have it can be seen that


+ np)
[
N-1
(2i + 1)(m + p n ) (2i + l ) ( m
( - l)qP'(Xij(p;a)- Xij(p;b))cos rr
i =i Ol X i j ( p ; b ) c o S
N 2N rr' i=O 2N
(25 .a)
I when n is even (20.a) and
S,b(m,n) = N-1
(2i+I)(m-np)
- (2i+l)(m+np) C (- "'( Xij(p;n) - Xij(p;b)) cos
2N
rr
( - l)qp'Xij(p;b) cos r 7 i=O
i=O 2N
(25.b)

I when n is odd. (20.b) correspond to one of k gpl for some I = 0,1,2; * ., N - 1.


Hence, for the computation of the N X N DCT sequence
{ym,,: m, n = 0,1,2; . ., N - 11, we need only {fpl: I =
Then, by substituting (17)-(20) into (16), we can express ymn
0,1,2;..,N-l} and (gpl: 1 = 0 , 1 , 2 ; . . , N - 1 } for p =
as
1,3,5,. . .,N - 1. This implies that the computation of N X N
DCT requires only the computation of N 1-D DCT's.
Now, after f p l and gpl are obtained, let us discuss the
additions and other operations required for the computation
(p: odd) of N X N DCT. From (21) and the definitions in (23) and
(2i + l ) ( m + np) (24), it is seen that ym,'s are expressed in terms of the
cos
2N TI summation of fpl's and gpl's. In order to see the relation-
N-l
(2i+I)(m-np)
ships that exist between ymn and fpl's,gpl's, for some arbi-
trarily chosen m and n , let us give some examples for N = 8
,E
t r=o (xt,(P;L?) + x,,(P,6))cos 2N TI]'
as follows:
when n is even (21 .a) 0 1/2(f,3
~ 3= + f i 3 + f 3 3 + f33 + f53 + f s 3 + f73 + f 7 3 )
(26.a)
Y52 = 1 / 2 ( f 1 7 + f13 - f35 + f31 - f 5 l + f 5 5 - f73 - f 7 7 )
(26.b)

I when n is odd.

But, it can be seen that

and
N-l
(2i + 1)(m - n p )
C
i=O
(Xij(p;a)+ Xij(p;b))COS
2N
T (22.b)
In the above example, addition operation in terms of fpl's
correspond to one of 1-D DCT's of data sequence ( x ~ ~ ( ~ ; ~and +
, gpl's for computing ymn's looks complicated. However,
we shall show that the addition operation can be imple-
Xj,(p;b)} depending on m and n. That is, by defining
mented by butterfly stages as in ordinary fast algorithms for
the discrete Fourier transform (DFT) and DCT. Now, in the
N-1 (2i 1)l +
case n is even, it can be seen that
fpl =
i=O
( X i j ( p ; u ) + Xij(p;b))cos rr (23) 7
(2i f 1)(m - n( N - p ) ) (2i + I ) ( m + n p )
cos rr = &cos
we can see that (22.a) and (22.b) correspond to one of f p I 2N + 2N
or - fpl for some 1 = 0,1,2; . ., N - 1. In the case of (21.b),
(27)
by defining
which implies that if
CHO AND LEE: FAST ALGORITHM

for some 1 = 0,1,2,. . .,N -1, then


N-l
(Xij(N-p;u) + xij(N-p;b))
i=O

(2i + l ) ( m - n( N - p ) ) (28 .b)


.cos 57 = ff ( N - p ) l .
2N
In the same way, for n odd, it can be shown that if
N-1

(2i+l)(m+np)
' cos 77- = g,l (29.a)
2N
then
N-l
( - l)qP'(Xij(N-p;a) - xij(N-p;b))
i=O

(2i + 1)( m - n( N - p))


.cos 'T = f g(N-pXN-1). (29.b)
2N
These relationships reveal that fpl always appears with
& f(N-p,l in (26), allowing us to form a butterfly stage. In
the case of g p l , since it appears with f g ( N - p ) ( N - l ) , we can
also form a butterfly stage. For the example of N = 8, (26)
can be rewritten as follows:
y30=1/2{(f13 +f73) +(f13+f73) +(f33+f53)

+ ( f 3 3 + f53)) (30.a)

Y52 = (f17 -f77) + ( f13 - f73)


-(f35+f.55)+(f31-f51)~ (30.b)

Y34 = ' l 2 { ( f 1 7 + f77) '(fll +f71)


- ( f 31 + f S 1 - (f31 + f 5 l ) (30 .c)

Y 26 = ( + O) + ( f 1 4 - f 7 4
- ( f 3 4 - f 5 4 ) - ( f 3 0 + f50)) (30.d)

Y41= ' l 2 { ( g l 5 + g73) +( g l 3 - g7.5)

+ ( g 3 7 + g51) + ( g 3 , - g57) 1 (30.e)

Y03=1/2{(g13- g75)+(g13-g75)

- ( g 3 7 + g5l) - ( g 3 7 + g d ) ( 30 .f)
y35=1/2{(0+ g70) +(gl2 + g76)

-(g32 + g5d - ( g 3 4 - g 5 4 ) ) (30.4


y.57 = -( gl4 + g74) '(gI2 - g76)
(b)
+ ( g 3 6 + gs*) - (g30 + O ) ) . (30.h)
Fig. 1. The signal flow graph for 8 x 8 DCT. (a) Signal flow graph from
Based on the formations shown above, as an example, the x i j to f and gpi. (b) Signal flow graph from fpi to ymn where n is
signal flow graph for an 8 x 8 DCT algorithm is shown in Fig. even. (cy Signal flow graph from g P i to ymn where n is odd. Broken
1. The signal flow graph is separated into three parts for lines represent transfer factors - 1 and full lines represent unity transfer
factor. 0 represents adders and + with 1/2 represents multiplication
convenience, i.e., Fig. l(a) is the signal flow graph from xij's by 1/2, which is equivalent to shift operation.
to fpr's and gpl's, Fig. l(b) is from fp,'s to ymn's, where n is
even, and Fig. l(c) is from gpl's to ymn's, where n is odd.
From Fig. l(a), it is seen that the 8 X 8 DCT requires only 8 In Figs. 1 and 2, since the multiplications by one-half are
I-D DCT's. From Fig. l(b) and (c), it is also seen that the equivalent to shift operations, the multiplications are re-
addition operations after the I-D DCT stages can be imple- quired only for the computation of I-D DCT's. Conse-
mented in butterfly form. The example for a 4 x 4 DCT is quently, the number of multiplications required for N X N
also shown in Fig. 2. DCT is equivalent to that for N 1-D DCT's.
I , I

3L2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 38, NO. 3, MARCH 1991

xoo
XI1
x22
x33
‘03
XI2
x21
x30
‘01
XI0
‘23
x3Z
xo2
XI3
xzo
31

Fig. 2. The signal flow graph for 4 X 4 DCT.

the I - D DCT’s, we may use any existing 1-D DCT algo-


rithms. When the highly efficient I - D DCT algorithms pro-
posed by Lee [61 or Hou [71 are used in the I-D DCT
(C)
computation, which require ( N/2)log2 N multiplications for
an N-point 1-D DCT, the required number of multiplica-
Fig. 1. (Continued) tions for the N X N DCT is given by
B. Signal Flow Graph for IDCT M = ( N 2 / 2 ) log, N.
In the case of orthogonal transforms, if the scale factor is On the other hand, the required number of additions is
not taken into account, then the signal flow graph for the the summation of those required for N I-D DCT’s and those
inverse transform is just the inverse of that for the forward for other additions as shown in Fig. 1. In other words, as can
transform. Similarly, in the proposed algorithm, the signal be seen from Figs. 1 or 2, we need additions for (1 +log, N)
flow graph for the denormalized IDCT can be obtained by butterfly stages and for the 1-D DCT stage. However, at the
simply inverting the forward DCT. However, we need some last stage, it is observed that the additions for yoj’s, yio’s and
modifications if we take into account the scale factor as y ( y / 2 X N / 2 ) are not required. Also, we can see that g,,’s and
shown in (1). As can be seen from the proposed signal flow fpo s, except for flo and f30, do not require butterfly pairs.
graph for the forward transform in Figs. 1 or 2, there are Thus the number of additions, except for those required for
some nodes that do not have their pairs. But this problem 1-D DCT stages, is shown to be N 2 ( 1 + l o g 2 N ) - 2 N -
can be easily solved by multiplying the node variables by 2 (N-2). Since the number of additions required for 1-D
when the direction of the flow is inverted. For example, in DCT is (3N/2)log2N- N + 1 [6], [7], the total number of
Fig. l(b), it is seen that two nodes in the line of fso and f,” additions required for the N x N DCT is given by
do not have their pairs, and thus the node variables should
be multiplied by 2 to keep the inverse flow correctly. Also,
due to the scale factor required for 1-D IDCT, the zeroth Thus, with the computations according to (31) and (321, we
input to 1-D IDCT should be multiplied by one-half, which is can compare the number of multiplications and additions
equivalent to shift operation. However, in the case of 1-D with those required for other fast 2-D DCT algorithms, such
IDCT’s for gp,’s,the multiplications by one-half for gp0’s and as [14]-[16]. The results are summarized in Table I. It is seen
the multiplications by 2 for the node variables are cancelled that the proposed algorithm requires the same number of
out by each other. The signal flow graph for the 8x8 and multiplications as in [16], which is the least of all and only
4 x 4 IDCT’s are shown in Figs. 3 and 4, respectively. If one half of that required for the conventional approach, while
warts to maintain the scale factor for output xij’s correctly, the number of additions is almost comparable to that of
it is necessary to multiply every node variables by one-half or other algorithms.
to divide the input y,,’s by N2/2. However, in the case of
fixed point computation, the former approach is better. In 111. PARALLEL
IMPLEMENTATION
summary, when the output of the forward transform is used OF THE PROPOSED
ALGORITHM
as input to these signal flow graphs for IDCT’s shown in
Figs. 3 and 4, they generate the same data sequence as the For VLSI or hardware parallel implementation of an
input to the forward transform. algorithm, reducing the number of multipliers is very impor-
tant, because they occupy a large area of the chip. Also
important considerations are regularity, modularity in the
C. Comparison with Other Fast 2 - 0 DCT Algorithms computation structure, and the complexity of data access
scheme. In this context, we first describe an implementation
In this section, we will compare the number of multiplica- scheme which reduces the number of multipliers being used
tions and additions with those of other fast algorithms for parallel implementation, and then discuss the problems
[14]-[16]. Let the number of multiplications and additions such as modularity, regularity, and data access scheme of the
required for the proposed algorithm by M and A , respec- architecture.
tively. Previously, we have shown that only N 1-D DCT‘s are It was shown that the number of multiplications required
required for the computation of the N x N DCT, and thus for the proposed algorithm is equivalent to that required for
the number of multiplications is only half of that required for N I-D DCT’s. Also, it seems that N I-D DCT modules are
the conventional row-column approach. In implementing required to compute N x N DCT in parallel. However, we

nl-- T
CHO AND LEE: FAST ALGORITHM 303

input

(C)
111put

Y 11
YO1
Y77
Y67
Y57
Y47
Y37
Y27
Y17
Y45
Y55
YS5
Y5l
Y4l
Y3l
Y2l
Yl5
Y25
Y35
Ya3
Y3a
Y23
Y13
Yo3
Y75
Y61
Y7l
Yo7
Y53
Y63
Y73
YO 5

Fig. 3. The signal flow graph for 8 x 8 IDCT. (a) Signal flow graph from ymn to f,,, where n is even. (b) Signal flow graph
from ymn to g,!, where n is odd. (c) Signal flow graph from f,, and g,, to x,,.

TABLE I
COMPARISON
OF THE NUMBER
OF MULTIPLICATIONS
AND ADDITIONS
~~~

Number of multiplications Number of additions

Conven- Conven-
tional Other fast algorithms Proposed tional Other fast algorithms Proposed

4x4
8x8
16x16
32x32
algorithm
32
192
1024
5120
[14]
24
144
768
3840
[I51
16
104
568
2840
[I61
16
96
512
2560
algorithm

512
2560
i: 1 algorithm
72
464
2592
13376
[14]
72
464
[15]
70
462
[I61
68
484
2592 2558 2531
13376 12950 12578
algorithm
74
466
2530
12738

o u t p lI t 111put shall show that N / 2 1-D DCT modules are sufficient by the
XOO use of multiplexers and demultiplexers. In Fig. l(a), it is seen
XlI
that the results for f,, and gpl remain to be the same even if
3
x22

03
the order of addition and 1-D DCT operation is reversed.
XI2 Hence the signal flow graph in Fig. l(a) is equivalent to Fig.
XZI
x30 5, in which the order of addition and 1-D DCT operation is
xo1 reversed for the data sets Rt;, R!, Rq, and R$. Now, by using
XI0
3 3 the multiplexers and demultiplexers as shown in Fig. 6, we
'32
x02 can reduce the number of 1-D DCT modules to N / 2 . That
'13
x20
is, in the upper part of Fig. 6, while the additions for R: and
x31 R! are in progress, 1-D DCT's for R! and Rb can be started.
Fig. 4. The signal flow graph for the 4 x 4 IDCT. Then, the results of additions for Rf and R f a r e sent to 1-D
304 IEEE TRANSACTIONS O N CIRCUITS AND SYSTEMS, VOL. 38, NO. 3, MARCH 1991

DCT
-
1- D

DCT

DCT
-
-
1- D

DCT
x ; w - - - -%G -

Fig. 5. Alternate signal flow graph of Fig. ](a). Fig. 6. Implementation of Fig. l(a) with multiplexers and demulti-
plexers.

DCT processors. This is possible because the 1-D DCT x 00


P
architectures in [6] and [7] allow us to perform the multiple x 11
x 22
processing in parallel and pipelined environment. More x 33
X 03
specifically, since the data move successively in the pipelined x 12
x 21
structure, we can start the computation of next input data X 30
x 01
immediately after the current input data. Thus it is noted x 10
x 23
that the 1-D DCT module need not run twice the speed, x 32
x 02
yielding almost the same computation time as compared to X 13
that of implementation with N 1-D’DCT modules. Similarly, x 20
X 31
in the lower part of Fig. 6, while the additions for R! and R$
are being performed, 1-D DCT’s for R4 and R: can also be Fig. 7. Parallel implementation of the 4x4 DCT algorithm.
started. Then, the results of additions are sent to 1-D DCT
modules throughout the multiplexer. The example for the
quires more complicated data access scheme and computa-
parallel implementation of 4 x 4 DCT is also shown in Fig. 7.
tion structure than the simple matrix-vector (row-column)
From Figs. 6 and 7, it is seen that the number of 1-D DCT
approach [ H I . But the matrix-vector approach results in the
modules required for N X N DCT is N/2, which is a quar-
largest chip area. There are always trade-off‘s between the
ter of that required for the conventional approach.
chip area and computation time, and hence there exist many
If we consider modularity and regularity, the proposed
variations in the implementations. Conclusively, it is very
implementation scheme has advantage over other fast algo-
difficult to determine an appropriate criterion for the VLSI
rithms such as [15] and [16], in which the polynomial trans-
implementation of the algorithms. However, this problem is
form and the complex arithmetic are required. Hence, the
beyond the scope of this paper.
proposed algorithm is believed to be more suitable for the
VLSI implementation than other 2-D FDCT algorithms in
terms of the number of multipliers, modularity, and regular- IV. CONCLUSIONS
ity. However, there are another problems that should be In this paper, a fast algorithm for the 2-D DCT is pro-
addressed in the VLSI implementation. For example, like posed. It is shown that the N X N DCT is obtained from N
most of other fast algorithms, the proposed algorithm re- 1-D DCT’s with some additions and shift operations. Thus
CHO AND LEE: FAST ALGORITHM 305

the total number of multiplications required for the N X N [Ill J. Makhoul, “A fast cosine transform in one and two dimen-
DCT is N times that for the I - D DCT, which is only half of sions,” IEEE Trans. Acoust., Speech, Signal Processing, vol.
ASSP-28, pp. 27-34, Feb. 1980.
that required for the conventional row-column approach. [12] F. A. Kamangar and K. R. Rao, “Fast algorithms for the 2-D
Hence the number of multiplication is the same as that of discrete cosine transform,” IEEE Trans. Comput., vol. C-31,
previously reported algorithm [16], which is known to be the pp. 899-906, Sept. 1982.
best in terms of the number of multiplications, while the [ 131 M. A. Haque, “A two-dimensional fast cosine transform,”
IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-33, pp.
number of additions is comparable to others. However, the 1532-1539, Dec. 1985.
proposed algorithm has advantages that it has regular and [14] C. Ma, “A fast recursive two dimensional cosine transform,”
systematic structure, and requires only real arithmetic, while Intelligent Robots and Computer Vision: Seventh in a Series,
the algorithm in [16] requires complex arithmetic. Also, in David P. Casasent, Ed., in Proc. SPIE 1002, pp. 541-548, 1988.
this paper, for the purpose of reducing the hardware com- [15] M. Vetterli, “Fast 2-D discrete cosine transform,” in Proc.
ICASSP’85, Mar. 1985.
plexity for the parallel implementation, an alternative scheme [16] P. Duhamel and C. Guillemot, “Polynomial transform compu-
is described with slight increase in time complexity. The tation of 2-D DCT,” in Proc. ICASSP ’90,pp. 1515-1518, Apr.
proposed scheme requires N / 2 I - D DCT modules, while 1990.
the direct implementation requires N 1-D DCT modules. [17] N. I. Cho and S. U. Lee, “A fast 4 x 4 DCT algorithm for the
Since there are always trade-offs between the chip area and recursive 2-D DCT,” ZEEE Trans. Acoust., Speech, Signal
Processing, submitted for publication.
computation time, it is very difficult to compare the perfor- [18] M.-T. Sun, T.-C. Chen, and A. M. Gottlieb, “VWI implemen-
mance of the implementation of the fast algorithms. How- tation of a 16x 16 discrete cosine transform,” IEEE Trans.
ever, considering only the hardware complexity, the pro- Circuits Syst., vol. 36, pp. 610-617, Apr. 1989.
posed algorithm is advantageous in that it requires very small
number of multiplications and has regular and systematic
structure compared to other fast algorithms.
Finally, another important aspect in a VLSI implementa-
tion is the precision of the algorithm, i.e., the amount of
errors due to the fixed-point implementation. This problem
is currently under investigation. Nam Ik Cho (S’86) received the B.S. and
M.S. degrees from Seoul National University,
REFERENCES Seoul, Korea, in 1986 and 1988, respectively,
in control and instrumentation engineering.
[I] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine He is currently working toward the Ph.D.
degree at Seoul National University.
transform,” IEEE Trans. Commun., vol. COM-23, pp. 90-93, His research interest is in digital signal
Jan. 1974. processing, including adapative filtering and
[2] N. Ahmed and K. R. Rao, Orthogonal Transformy for Digital VLSI implementation.
Signal Processing. New York: Springer-Verlag, 1975.
[3] W. H. Chen, C. H. Smith, and S. C. Fralick, “A fast computa-
tional algorithm for discrete cosine transform,” IEEE Trans.
Commun., vol. COM-25, pp. 1004-1009, Nov. 1977.
[4] M. J. Narashimha and A. M. Peterson, “On the computation of
the discrete cosine transform,” IEEE Trans. Commun., vol.
COM-26, pp. 934-936, June 1978.
[51 M. D. Wagh and H. Ganesh, “A new algorithm for the discrete
cosine transform of arbitrary number of points,’’ IEEE Trans
Comput., vol. C-29, pp. 269-277, Apr. 1980.
[61 B. G. Lee, “A new algorithm to compute the discrete cosine Sang Uk Lee (S’75-M’79) received the B.S.
transform,” IEEE Trans. Acoust ., Speech, Signal Processing, degree from Seoul National University in
vol. ASSP-32, pp. 1243-1245, Dec. 1984 1973, the M.S. degree from Iowa State Uni-
[7] H. S. Hou, “A fast recursive algorithms for computing the versity in 1976, and the Ph.D. degree from
discrete cosine transform,” IEEE Trans. Acoust., Speech, Sig- the University of Southern California, Los
nul Processing, vol. ASSP-35, pp. 1455-1461, Oct. 1987. Angeles, in 1980, all in electrical engineering.
[81 N. I. Cho and S. U. Lee, “DCT algorithms for VLSI parallel In 1980, he was with General Electric,
implementation,” IEEE Trans. Acowt., Speech, Signal Process- Lynchburg, VA, and in 1981 he joined the
ing, vol. 38, pp. 121-127, Jan. 1990. M/A-COM Research Center, Rockville, MD.
[91 M. Vetterli and H. Nussbaumer, “Simple FFT and DCT algo- He is now with the Department of Control
rithms with reduced number of operations,” Signal Process, vol. and Instrumentation at Seoul National Uni-
6, pp. 267-278, Aug. 1984 versity, where he is an Associate Professor. His current research
[lo] P Duhamel and H. H’Mida, “New 2“ DCT algorithms suitable interests are in the areas of image and speech signal processing,
for VLSI implementation,” in Proc. ICASSP’87, pp. 1805-1808, including VLSI and neural computing.
1987. Dr. Lee is a member of Phi Kappa Phi.

You might also like