Implementation of Efficient Parallel Discrete Cosine Transform Using Stochastic Logic
Implementation of Efficient Parallel Discrete Cosine Transform Using Stochastic Logic
Abstract—This paper provides a new scheme for the VLSI In this paper we present a parallel stochastic adder to
implementation of a parallel Discrete Cosine Transform (DCT) improve the overall system throughput. We then discuss a new
using stochastic logic. Stochastic computation is a number scheme to design a fully parallel stochastic DCT using our
representation, which can carry out complex computations with proposed adders. Our simulation results show it can achieve
very low hardware cost. However, the delay of data output is high-throughput needed for real-time image processing. The
proportional to the length of serial sequence. We provide a new remaining paper is organized as follows. In Section II, we
area-saving parallel DCT design to improve the system briefly introduce the stochastic representation of stochastic
throughput by using our proposed stochastic OR-adder and OR- numbers and basic stochastic computation blocks. In Section
AND-adder. Results show the proposed parallel stochastic DCT III, we describe the design of our OR-adder and an OR-AND-
can meet the requirement of image processing while adder as well as the parallel stochastic DCT. In Section IV, we
maintaining a 5% performance difference compared to the show the performance measurements. The paper is finally
traditional DCT implementation. Our synthesized chip design concluded in Section V.
using the TSMC CMOS 130nm technology also shows that the
proposed parallel stochastic DCT is at least 10 times more II. STOCHASTIC NUMBER REPRESENTATION
efficient in area and delay than that of the traditional DCT and A. Stochastic computation
the serial stochastic DCT.
Stochastic logic use the following number representation:
Keywords—Stochastic logic, image processing, parallel DCT Each real-valued number x (0 x 1) is represented by
I. INTRODUCTION a sequence of random bits.
DCT is a widely used image compression technique [1]. The probability of bit “1” in a random bit stream has
But, it has high power consumption. The stochastic number value of x . For example, one of representations of real
system proposed in [2] can achieve better fault tolerance, ultra- value 0.5 is (0,1,0,1,1,0,0,1,1,0) in stochastic
higher clock rate and has a much simpler circuit structure [3] representation, in which there are five “1”.
than the traditional DCT implementation. Therefore, further A sample data flow diagram of stochastic computation is
study of this new representation of low-complex and low-power shown in Fig.1. There are two number representation domains.
DCT implementation is valuable. Conversion between Two’s Complement System (TCS) domain
Existing works of stochastic based implementation are : and Probability domain is linear due to the isometric
stochastic logic circuits synthesis (division operation [3], compression property[3]:
polynomial computation [4], and numerical integration [5]) and x / c 1
Px ,(c x c,0 px 1) (1)
functional module implementations (stochastic FIR [6]). 2
However the data throughputs of the above designs are low due Where c is the range of the input number. For example, the input
to the long data sequence. Therefore, the stochastic-based DCT value is 2 shown in Fig. 1. The range is 8, and thus the whole
implementation has been proposed, but there are three number requires a 3-bit to represent. In the probability domain,
challenges: there are two ways to represent a number: random bit stream
1. Additions in the stochastic DCT are difficult to and probability value. The conversion (T2P) from the TCS
implement and become the bottleneck in terms of domain to the probability domain is the first step referring to
hardware cost and critical path latency, which is Eq (1). After this conversion the value is 0.625 shown in Fig.1.
completely different from the traditional DCT design Conversion from a probability number to a random bit stream
using Two’s Complement System (TCS). (P2S) is the second step. The example number becomes the
sequence: (1,1,0,1,1,0,1,0). These two steps are referred to as a
2. The throughput of serial stochastic DCT is low because
“forward conversion” and then stochastic bit streams are
the original stochastic representations have long output
operated on using stochastic computational elements. The
delay. results are then converted back to a probability number (S2P)
3. Conversion between the stochastic number system and and TCS domain (P2T), which is referred as a “reverse
the traditional number system has high hardware conversion”.
complexity.
707
Authorized licensed use limited to: China University of Petroleum. Downloaded on March 06,2024 at 23:51:06 UTC from IEEE Xplore. Restrictions apply.
TABLE I. THE COMPARISON OF DIFFERENT COMPUTATION ELEMENTS adder shown in Fig.3 (b) to handle the overflow situation. AND
Computation elements Chip Area(um ) 2
Critical path Clock cycle gate is used to provide the value loss caused by an OR operation.
Serial Stochastic Adder 4.2 mux 2N-1
In an OR-AND-adder, a general stochastic sequence can be
used because an ordered sequence is no longer required.
Serial Stochastic Multiplier 2.8 1 gate 2N-1
Parallel Stochastic Adder 4.2* 2 N-1
mux 1 B. Parallel stochastic DCT
Proposed Parallel 2.8*2N-1 1 gate 1 We deduce the algorithm of DCT as:
Stochastic Adder X (0) C4 C4 C4 C4 0 0 0 0 x(0) x(7)
X (2) C C6 C6 C2 0 0 0 0 x(1) x(6)
Parallel Stochastic Multiplier 2.8*2N-1 1gate 1 2
TCS Adder 5N 2N gates 1 X (4) C4 C4 C4 C4 0 0 0 0 x(2) x(5)
2 X (6) C6 C2 C2 C6 0 0 0 0 x(3) x(4)
TCS Multiplier 5N 2N units 1
X (1) 0 0 0 0 C1 C3 C5 C7 x(0) x(7)
TCS adder we use carry-ripple addition circuit and TCS multiplier is parallel carry-ripple
X (3) 0 0 0 0 C3 C7 C1 C5 x(1) x(6)
array multiplier in [11] and N is the bit-width. We use ASIC platform, Design Compiler
(DC), and TSMC CMOS 130nm technique. X (5) 0 0 0 0 C5 C1 C7 C3 x(2) x(5)
px 0.25 X (7) 0 0 0 0 C7 C5 C3 C1 x(3) x(4)
1, 1 ,0, 0, 0, 0 ,0 ,0
1 ,1, 0 ,0 ,1 ,1 ,1 ,1
(9)
The multiplier in Eq. (9) is not simplified than that in [8].
0, 0, 0 ,0 ,1, 1 ,1, 1 Pz Px Py 0.75
py 0.5 In addition, multiplier is the bottleneck of the traditional DCT
(a) computation. However, in stochastic logic, multiplier can be
px 0.375
realized by a very simple gate. The implementation of parallel
1, 1 ,0, 0, 0, 0 ,1 ,0
1 ,1, 0 ,1 ,1 ,1 ,1 ,1 DCT in Eq. (9) using the stochastic logic does not require high
1, 0, 0 ,1 ,1, 1 ,1, 1 Pz 0.875 hardware-cost. In that case, the addition in stochastic number
py 0.75 1 ,0, 0 ,0 ,0 ,0 ,1 ,0 representation becomes the bottleneck. In DCT implementation,
Pz ' 0.25 We can employ our proposed OR-adder and OR-AND-adder to
(b) reduce the area of addition in parallel structure. The algorithm
Fig. 3. The proposed stochastic adder (a) OR-adder (b) OR-
can be divided into eight independent units. Each unit structure
AND-adder of the DCT is shown in Fig.4. We use an OR-adder as Type-A
of stochastic adder since input sequences have been generated
x 1 A B C in the ordered form, and use an OR-AND-adder as Type-B of
c
2
2 L1 x (0) x (7)
C2 stochastic adder since the random ordered sequence is required
x C6
x (1) x (6)
C6 P2T converter to keep high performance for the stochastic multipliers [3]. We
x (2) x (5)
up-down counter bit stream C2 X(2) use the unordered coefficient sequence as one operand of the
x (3) x (4)
T2P converter multiplier to guarantee its performance. The output of
stochastic multiplier is random sequence. Part-C is the final
processing shown in Fig.4, which adds the overflowed data to
the outputs of stochastic computational results in TCS domain
counter2 to avoid the calculation errors. The proposed structure C in
Fig.4 can realize the P2T conversion function and can also
compensate the performance loss.
counter1
IV. IMPLEMENTATION AND DISCUSSION
B C
We have used MATLAB2011 as a simulation platform for
Fig. 4. The basic unit of parallel stochastic DCT
the performance measurements. The results are shown in Fig.5.
The benchmark signal of Signal Noise Rate (SNR) is
signal of a sequence generator. The quantization error is floating-point TCS DCT results. The absolute value of the
eq px Q( px ) , and the quantization noise is deviation between the result of floating-point DCT and fixed-
point TCS DCT, serial stochastic DCT (SDCT), parallel
L 1 2
q 2 E [ px Q( px )]2 =
(7) stochastic DCT (PDCT) is the noise of SDCT-SNR, PDCT-
12 L3 12 SNR and DCT-SNR, respectively. Performance analysis results
Therefore, the quantization noise ratio of stochastic logic is the show that the proposed parallel stochastic DCT can achieve
same as the TCS quantize 5% performance difference compared to the traditional TCS
S DCT. We also implemented the serial stochastic DCT, parallel
( )q 6 N 6 2L 1 (dB) (8)
N stochastic DCT and TCS DCT on ASIC platform, Design
In some cases, if the input bit streams cannot satisfy the design Compiler (DC) using the TSMC CMOS 130nm technology.
rules, overflow occurs. Therefore, we design the OR-AND- The synthesis results are shown in Table II.
708
Authorized licensed use limited to: China University of Petroleum. Downloaded on March 06,2024 at 23:51:06 UTC from IEEE Xplore. Restrictions apply.
TABLE II. THE COMPARISON OF THREE DIFFERENT SCHEMES OF DCT
Bit-width(length of Sequence)
8(128) 10(512) 12(2048)
Scheme* SDCT PDCT DCT SDCT PDCT DCT SDCT PDCT DCT
Chip Area(um2) 2599 13824 19010 2599 55296 25004 2599 221184 31073.5
Delay 204.8 0.25# 11.6 819.2 0.25 15.8 3276.8 0.25 19.55
AT(103) 532 3.4 220 2129 13.8 395 8516 55.2 607
AT2(106) 108 0.0008 2.5 1743 0.0035 6.2 27892 0.013 11.8
*SDCT: serial stochastic DCT in [9]; PDCT: parallel stochastic DCT; DCT: DCT in TCS. Delay is calculated by the clock cycle and the critical path. The correspondence between bit
width in TCS and sequence length in stochastic number system is L 2N 1 for one bit signed number, L is the length of sequence, N is bit-width of fixed point.
# means the data come out of the synthesis with Design Compiler (DC) using TSMC CMOS 130nm logical library.
40
35
V. CONCLUSIONS
Performance(dB)
30
In this paper, we proposed the design of basic stochastic
25 based unit: an OR-adder, and an OR-AND-adder. Such parallel
architectures can replace conventional multiplier and
20 significantly save chip area. We also implemented the parallel
Serial Stochastic DCT
stochastic DCT design, which can meet the performance
15 Parallel Stochastic DCT
Traditional DCT
requirements of real-time imaging processing (up to 450 MHz)
10 while using less area.
7 8 9 10 11
Bit-Width(bit)
709
Authorized licensed use limited to: China University of Petroleum. Downloaded on March 06,2024 at 23:51:06 UTC from IEEE Xplore. Restrictions apply.