0% found this document useful (0 votes)
32 views

Implementation of Efficient Parallel Discrete Cosine Transform Using Stochastic Logic

This document proposes a new scheme to implement a parallel Discrete Cosine Transform (DCT) using stochastic logic in an efficient way. It presents a parallel stochastic adder to improve system throughput. It then discusses a design for a fully parallel stochastic DCT using the proposed adders. Simulation results show this approach can achieve the high throughput needed for real-time image processing while maintaining performance within 5% of a traditional DCT implementation. The synthesized chip also shows the proposed design is over 10 times more area and delay efficient than traditional and serial stochastic DCT designs.

Uploaded by

lpgx1962
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Implementation of Efficient Parallel Discrete Cosine Transform Using Stochastic Logic

This document proposes a new scheme to implement a parallel Discrete Cosine Transform (DCT) using stochastic logic in an efficient way. It presents a parallel stochastic adder to improve system throughput. It then discusses a design for a fully parallel stochastic DCT using the proposed adders. Simulation results show this approach can achieve the high throughput needed for real-time image processing while maintaining performance within 5% of a traditional DCT implementation. The synthesized chip also shows the proposed design is over 10 times more area and delay efficient than traditional and serial stochastic DCT designs.

Uploaded by

lpgx1962
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Implementation of Efficient Parallel Discrete Cosine

Transform Using Stochastic Logic


YanLi1, Jianhao Hu1, Jie Chen2
1
University of Electronic Science and Technology of China, China
2
University of Alberta, Edmonton, AB, Canada
[email protected], [email protected], [email protected]

Abstract—This paper provides a new scheme for the VLSI In this paper we present a parallel stochastic adder to
implementation of a parallel Discrete Cosine Transform (DCT) improve the overall system throughput. We then discuss a new
using stochastic logic. Stochastic computation is a number scheme to design a fully parallel stochastic DCT using our
representation, which can carry out complex computations with proposed adders. Our simulation results show it can achieve
very low hardware cost. However, the delay of data output is high-throughput needed for real-time image processing. The
proportional to the length of serial sequence. We provide a new remaining paper is organized as follows. In Section II, we
area-saving parallel DCT design to improve the system briefly introduce the stochastic representation of stochastic
throughput by using our proposed stochastic OR-adder and OR- numbers and basic stochastic computation blocks. In Section
AND-adder. Results show the proposed parallel stochastic DCT III, we describe the design of our OR-adder and an OR-AND-
can meet the requirement of image processing while adder as well as the parallel stochastic DCT. In Section IV, we
maintaining a  5% performance difference compared to the show the performance measurements. The paper is finally
traditional DCT implementation. Our synthesized chip design concluded in Section V.
using the TSMC CMOS 130nm technology also shows that the
proposed parallel stochastic DCT is at least 10 times more II. STOCHASTIC NUMBER REPRESENTATION
efficient in area and delay than that of the traditional DCT and A. Stochastic computation
the serial stochastic DCT.
Stochastic logic use the following number representation:
Keywords—Stochastic logic, image processing, parallel DCT  Each real-valued number x (0  x  1) is represented by
I. INTRODUCTION a sequence of random bits.
DCT is a widely used image compression technique [1].  The probability of bit “1” in a random bit stream has
But, it has high power consumption. The stochastic number value of x . For example, one of representations of real
system proposed in [2] can achieve better fault tolerance, ultra- value 0.5 is (0,1,0,1,1,0,0,1,1,0) in stochastic
higher clock rate and has a much simpler circuit structure [3] representation, in which there are five “1”.
than the traditional DCT implementation. Therefore, further A sample data flow diagram of stochastic computation is
study of this new representation of low-complex and low-power shown in Fig.1. There are two number representation domains.
DCT implementation is valuable. Conversion between Two’s Complement System (TCS) domain
Existing works of stochastic based implementation are : and Probability domain is linear due to the isometric
stochastic logic circuits synthesis (division operation [3], compression property[3]:
polynomial computation [4], and numerical integration [5]) and x / c 1
 Px  ,(c  x  c,0  px  1) (1)
functional module implementations (stochastic FIR [6]). 2
However the data throughputs of the above designs are low due Where c is the range of the input number. For example, the input
to the long data sequence. Therefore, the stochastic-based DCT value is 2 shown in Fig. 1. The range is 8, and thus the whole
implementation has been proposed, but there are three number requires a 3-bit to represent. In the probability domain,
challenges: there are two ways to represent a number: random bit stream
1. Additions in the stochastic DCT are difficult to and probability value. The conversion (T2P) from the TCS
implement and become the bottleneck in terms of domain to the probability domain is the first step referring to
hardware cost and critical path latency, which is Eq (1). After this conversion the value is 0.625 shown in Fig.1.
completely different from the traditional DCT design Conversion from a probability number to a random bit stream
using Two’s Complement System (TCS). (P2S) is the second step. The example number becomes the
sequence: (1,1,0,1,1,0,1,0). These two steps are referred to as a
2. The throughput of serial stochastic DCT is low because
“forward conversion” and then stochastic bit streams are
the original stochastic representations have long output
operated on using stochastic computational elements. The
delay. results are then converted back to a probability number (S2P)
3. Conversion between the stochastic number system and and TCS domain (P2T), which is referred as a “reverse
the traditional number system has high hardware conversion”.
complexity.

This work was supported by the National Natural Science Foundation of


China under Grant 61371104. This work was also supported by NSREC
Discovery grant of Canada.

978-1-4799-5341-7/16/$31.00 ©2016 IEEE 706


Authorized licensed use limited to: China University of Petroleum. Downloaded on March 06,2024 at 23:51:06 UTC from IEEE Xplore. Restrictions apply.
If input x  2 px  0.625
(c  8 ,3bit ) (refert to eq.(1)) pseq _ x  1,1, 0,1,1, 0,1, 0 if pseq _ y  0, 0, 0,1,1, 0,1, 0 p y  0.375 y  2

source convert to convert to stochastic back to back to


output
[  c, c ] probability domain stochastic bit stream computation probability domain original domain

Fig.1 The flow diagram of stochastic computation.

A comparator (not shown here) is used to generate a random


bit streams [3], and counters are used to count the number of III. PARALLEL SCHEME FOR STOCHASTIC DCT
“1” in the output bit-sequence. For instance, if the output
sequence y is (0,0,0,1,1,0,1,0) then, during the S2P, the result is The stochastic adder is the bottleneck in terms of both
0.375, and the final output value in TCS is -2 shown in Fig.1. hardware area and algorithm complexity in stochastic
computation. It is different from the traditional optimization for
B. Stochastic logic DCT algorithm which focuses on the multiplication operations.
The followings are stochastic computational units: The existing stochastic designs usually ignore the forward and
reverse conversional circuits between the traditional number
1). Stochastic inverter system and the stochastic number representation. Such a design
has high area-overhead. Therefore, we propose our stochastic
Let x   x  [c, c] , adder as follows to reduce the area of the original adder as well
x / c 1 x / c 1 as the conversions structure.
px   1  1  px (2)
2 2 A. OR-Adder and OR-AND-adder
The TCS inverter is a stochastic inverter expressed by (3).
Inspired by the stochastic NXOR gate [3], we proposed OR-
2). Stochastic mutiplier
adder which use the OR gate to compute stochastic addition:
Let x [c1, c1 ] and y [c2 , c2 ]
x y Pz  Px  Py  Px  Py (6)
1 x 1 y 1 x 1 y 1
c1  c2 c c c c The function of an OR gate is shown in Eq. (6), which is not
pz  1 2 1  2  1  2
2 2 2 2 2 satisfied the addition function unless Px  Py =0. Consequently,
 (1  Px )(1  Py )  Px Py we design the input sequences according to the following rules
(3)
NXOR gates are used to perform stochastic multiplications. (an example to explain the OR-adder shown in Fig.3 (a)):
3). Stochastic adder  Both of the input bit streams are ordered sequences,
Let x  [c1 , c1 ] and y  [c2 , c2 ] , which means that bit “1” is either at the beginning or
the end of the input sequence. The ordered sequence can
x +y 1 be realized by the structure of T2P in Fig.4. For example,
c1 +c2 c1 c by using an ascent counter, the output bit stream is
pz  =  p x + 2  py (4)
2 c1 +c2 c1 +c2 {1,1,0,0,0,0,0,0} . On the contrary, if using a decent
counter, {0,0,0,0,0,0,1,1}becomes the output bits.
When c1  c2  c , we get [3]:  The value of both sequences is scaled to be less than 0.5,
1 1 thence overflow cannot occur in OR-adder.
Pz  Px  Py (5) The performance comparison is shown in TABLE I. The
2 2 proposed stochastic adder has less area and a shorter critical
Eq. (5) is a special case of Eq. (4) only when the input numbers path. The ordered sequence can also reduce the T2P overhead
have the same dynamic range. Therefore, a multiplexer (MUX) since traditional T2P uses interleaver or the Golden code to
can be used to perform a scaling stochastic addition when the generate the unordered bit streams [7]. Here, we provide a
MUX deals with a random stream with the probability of 0.5 as mathematic proof of the performance analysis of our ordered
shown in Fig.2. sequences. We use uniform quantization to analyze errors
caused by sequence conversion in our design. Assuming the
1, 0, 1, 0, 1, 1, 0, 0
range of a number is within [0,1] and L is the length of bit
p x  0.5
0 1, 0 ,0, 0 ,0 ,1, 0 1 streams, The quantization step size   1 L . Let p x be the input
1 1 1
0, 0, 0, 1, 0 ,0, 0, 1 Pz  Px  Py  0.375 signal for the stochastic quantizes (sequence generator), and
p y  0.25 2 2
Q( px ) be the output.
0, 0 ,1 ,0 ,1 ,0 ,1, 1
S  0.5

Fig.2 The stochastic adder.

707
Authorized licensed use limited to: China University of Petroleum. Downloaded on March 06,2024 at 23:51:06 UTC from IEEE Xplore. Restrictions apply.
TABLE I. THE COMPARISON OF DIFFERENT COMPUTATION ELEMENTS adder shown in Fig.3 (b) to handle the overflow situation. AND
Computation elements Chip Area(um ) 2
Critical path Clock cycle gate is used to provide the value loss caused by an OR operation.
Serial Stochastic Adder 4.2 mux 2N-1
In an OR-AND-adder, a general stochastic sequence can be
used because an ordered sequence is no longer required.
Serial Stochastic Multiplier 2.8 1 gate 2N-1
Parallel Stochastic Adder 4.2* 2 N-1
mux 1 B. Parallel stochastic DCT
Proposed Parallel 2.8*2N-1 1 gate 1 We deduce the algorithm of DCT as:
Stochastic Adder  X (0)  C4 C4 C4 C4 0 0 0 0   x(0)  x(7) 
 X (2)  C C6 C6 C2 0 0 0 0   x(1)  x(6) 
Parallel Stochastic Multiplier 2.8*2N-1 1gate 1    2  
TCS Adder 5N 2N gates 1  X (4)  C4 C4 C4 C4 0 0 0 0   x(2)  x(5) 
    
2  X (6)   C6 C2 C2 C6 0 0 0 0   x(3)  x(4) 
TCS Multiplier 5N 2N units 1
 X (1)   0 0 0 0 C1 C3 C5 C7   x(0)  x(7) 
TCS adder we use carry-ripple addition circuit and TCS multiplier is parallel carry-ripple     
 X (3)   0 0 0 0 C3 C7 C1 C5   x(1)  x(6) 
array multiplier in [11] and N is the bit-width. We use ASIC platform, Design Compiler
(DC), and TSMC CMOS 130nm technique.  X (5)   0 0 0 0 C5 C1 C7 C3   x(2)  x(5) 
    
px  0.25  X (7)   0 0 0 0 C7 C5 C3 C1   x(3)  x(4) 
1, 1 ,0, 0, 0, 0 ,0 ,0
1 ,1, 0 ,0 ,1 ,1 ,1 ,1
(9)
The multiplier in Eq. (9) is not simplified than that in [8].
0, 0, 0 ,0 ,1, 1 ,1, 1 Pz  Px  Py  0.75
py  0.5 In addition, multiplier is the bottleneck of the traditional DCT
(a) computation. However, in stochastic logic, multiplier can be
px  0.375
realized by a very simple gate. The implementation of parallel
1, 1 ,0, 0, 0, 0 ,1 ,0
1 ,1, 0 ,1 ,1 ,1 ,1 ,1 DCT in Eq. (9) using the stochastic logic does not require high
1, 0, 0 ,1 ,1, 1 ,1, 1 Pz  0.875 hardware-cost. In that case, the addition in stochastic number
py  0.75 1 ,0, 0 ,0 ,0 ,0 ,1 ,0 representation becomes the bottleneck. In DCT implementation,
Pz '  0.25 We can employ our proposed OR-adder and OR-AND-adder to
(b) reduce the area of addition in parallel structure. The algorithm
Fig. 3. The proposed stochastic adder (a) OR-adder (b) OR-
can be divided into eight independent units. Each unit structure
AND-adder of the DCT is shown in Fig.4. We use an OR-adder as Type-A
of stochastic adder since input sequences have been generated
x 1 A B C in the ordered form, and use an OR-AND-adder as Type-B of
c
2
2 L1 x (0)  x (7)
C2 stochastic adder since the random ordered sequence is required
x C6 
 x (1)  x (6)
 C6  P2T converter to keep high performance for the stochastic multipliers [3]. We
x (2)  x (5)
up-down counter bit stream  C2  X(2) use the unordered coefficient sequence as one operand of the
x (3)  x (4)
T2P converter multiplier to guarantee its performance. The output of
stochastic multiplier is random sequence. Part-C is the final
processing shown in Fig.4, which adds the overflowed data to
the outputs of stochastic computational results in TCS domain
counter2 to avoid the calculation errors. The proposed structure C in
Fig.4 can realize the P2T conversion function and can also
compensate the performance loss.
counter1 
IV. IMPLEMENTATION AND DISCUSSION
B C
We have used MATLAB2011 as a simulation platform for
Fig. 4. The basic unit of parallel stochastic DCT
the performance measurements. The results are shown in Fig.5.
The benchmark signal of Signal Noise Rate (SNR) is
signal of a sequence generator. The quantization error is floating-point TCS DCT results. The absolute value of the
eq  px  Q( px ) , and the quantization noise is deviation between the result of floating-point DCT and fixed-
point TCS DCT, serial stochastic DCT (SDCT), parallel
L  1 2
 q 2  E [ px  Q( px )]2   =
 (7) stochastic DCT (PDCT) is the noise of SDCT-SNR, PDCT-
12 L3 12 SNR and DCT-SNR, respectively. Performance analysis results
Therefore, the quantization noise ratio of stochastic logic is the show that the proposed parallel stochastic DCT can achieve 
same as the TCS quantize 5% performance difference compared to the traditional TCS
S DCT. We also implemented the serial stochastic DCT, parallel
( )q  6 N  6  2L 1 (dB) (8)
N stochastic DCT and TCS DCT on ASIC platform, Design
In some cases, if the input bit streams cannot satisfy the design Compiler (DC) using the TSMC CMOS 130nm technology.
rules, overflow occurs. Therefore, we design the OR-AND- The synthesis results are shown in Table II.

708
Authorized licensed use limited to: China University of Petroleum. Downloaded on March 06,2024 at 23:51:06 UTC from IEEE Xplore. Restrictions apply.
TABLE II. THE COMPARISON OF THREE DIFFERENT SCHEMES OF DCT

Bit-width(length of Sequence)
8(128) 10(512) 12(2048)
Scheme* SDCT PDCT DCT SDCT PDCT DCT SDCT PDCT DCT
Chip Area(um2) 2599 13824 19010 2599 55296 25004 2599 221184 31073.5
Delay 204.8 0.25# 11.6 819.2 0.25 15.8 3276.8 0.25 19.55
AT(103) 532 3.4 220 2129 13.8 395 8516 55.2 607
AT2(106) 108 0.0008 2.5 1743 0.0035 6.2 27892 0.013 11.8

*SDCT: serial stochastic DCT in [9]; PDCT: parallel stochastic DCT; DCT: DCT in TCS. Delay is calculated by the clock cycle and the critical path. The correspondence between bit
width in TCS and sequence length in stochastic number system is L  2N 1 for one bit signed number, L is the length of sequence, N is bit-width of fixed point.
# means the data come out of the synthesis with Design Compiler (DC) using TSMC CMOS 130nm logical library.

40

35
V. CONCLUSIONS
Performance(dB)

30
In this paper, we proposed the design of basic stochastic
25 based unit: an OR-adder, and an OR-AND-adder. Such parallel
architectures can replace conventional multiplier and
20 significantly save chip area. We also implemented the parallel
Serial Stochastic DCT
stochastic DCT design, which can meet the performance
15 Parallel Stochastic DCT
Traditional DCT
requirements of real-time imaging processing (up to 450 MHz)
10 while using less area.
7 8 9 10 11
Bit-Width(bit)

Fig. 5. The performance comparision of different DCT REFERENCES


implementation. Here SDCT stands for computing stochastic
[1] N. Ahmed, T. Natarajan, and K. R. Rio, “Discrete Cosine Transform,”
DCT in serial while PDCT stands for somputing stochastic
IEEE Trans. Comput., vol. 23, pp. 90-93, Jan. 1974.
DCT in parallel.
[2] .R. Gaines, "Stochastic Computing Systems,"Advances inInformation
Systems Science, J.F. Tou, ed., vol. 2, chapter 2, pp. 37-172, New York:
TABLE III. THE RESULTS OF DIFFERENT FPGA IMPLEMENTS
Plenum, 1969.
[3] Brown, B.D.; Card, H.C., "Stochastic neural computation. I.
Single LUT 2 Parallel Serial Proposed Computational elements," IEEE Transactions on Computers, vol.50,
Implement DCT[10] LUT Stochastic With no.9, pp.891,905, Sep 2001.
schemes DCT[10] DCT[9] T2P ,P2T [4] Weikang Qian; Riedel, M.D., "The synthesis of robust polynomial
Without L=32 arithmetic with stochastic logic,"Design Automation Conference, 2008.
T2P,P2T DAC 2008. 45th ACM/IEEE, vol., no., pp.648,653, 8-13 June 2008.
LUT 1175 1525 406 2264 [5] Weikang Qian; Chen Wang; Peng Liet al., "An efficient implementation
Xilinx Xilinx Xilinx Xilinx of numerical integration using logical computation on stochastic bit
FPGA chip XC3S100E XC3S100E XC3S100E XC6vlx75t streams,"Computer-Aided Design (ICCAD), 2012 IEEE/ACM
International Conference on, vol., no., pp.156,162, 5-8 Nov. 2012.
Max.Frequency 50.35 MHz 55.5 MHz 108.9 MHz 450.1MHz [6] Jienan Chen, Jianhao Hu "A Nova FIR Filter Based on Stochastic
Logic,"accepyed for Circuits and Systems (ISCAS), 2013 IEEE
International Symposium on.
Although the SDCT has the ultra-low resources compared to [7] K k.Parhi."VLSI Digital Signal Processing Systems: Design and
TCS DCT, and it has only 10% hardware cost. However the Implementation. Hoboken, New Jersey: John Wiley & Sons, Inc, 1999.
system delay is proportional to the length of serial sequence. [8] Loeffler, C.et al, "Algorithm-architecture mapping for custom DSP
We use AT Area  Time) and AT2 (Area  Time2) to compare chips,"Circuits and Systems, 1988., IEEE International Symposium on,
vol., no., pp.1953,1956 vol.2, 7-9 June 1988
the performance of hardware-cost and the system throughput.
[9] Yan Li, Jianhao Hu "A novel implementation scheme for high are-
The area-delay (AT or AT2) performance of the parallel efficient DCT based on signed stochastic computation,"accepyed for
stochastic DCT in Table II is at least 10 times efficient in area Circuits and Systems (ISCAS), 2013 IEEE International Symposium on.
and delay than that of the traditional DCT and the serial [10] Singh, G. “Multiplier-less Floating Point 1D DCT Implementation,” in
stochastic DCT. The FPGA results are shown in Table III, the TENCON 2008 - 2008 IEEE Region 10 Conference, 19-21 Nov. 2008.
proposed structure includes the T2P and PT2 converters have
trade-off area for high operating frequency at 450MHz.

709
Authorized licensed use limited to: China University of Petroleum. Downloaded on March 06,2024 at 23:51:06 UTC from IEEE Xplore. Restrictions apply.

You might also like