0% found this document useful (0 votes)
16 views5 pages

Optimizacion

This paper explores the optimization of stochastic computing (SC) based deep learning systems through the implementation of parallel finite state machines (FSM) to reduce latency issues associated with long bitstreams. The study compares the accuracy and performance of parallel FSMs against traditional serial implementations, demonstrating that parallel FSMs can achieve comparable accuracy while significantly decreasing computation time. The findings suggest that integrating parallelism into SC-based deep learning systems can effectively address hardware constraints while maintaining performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

Optimizacion

This paper explores the optimization of stochastic computing (SC) based deep learning systems through the implementation of parallel finite state machines (FSM) to reduce latency issues associated with long bitstreams. The study compares the accuracy and performance of parallel FSMs against traditional serial implementations, demonstrating that parallel FSMs can achieve comparable accuracy while significantly decreasing computation time. The findings suggest that integrating parallelism into SC-based deep learning systems can effectively address hardware constraints while maintaining performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Optimization of Stochastic Computing Based Deep

Learning Systems with Parallel Finite State Machine


Implementation
Jinjie Liu
Department of Electrical Engineering & Computer Science University of Michigan, Ann Arbor
[email protected]

ABSTRACT fraction of these applications poses strict hardware constraints on


Deep learning has become an increasingly heated topic as artificial the implementation of such deep learning systems. The limitations
intelligence is on the rise. At the same time, hardware restrictions include but are not limited to low power consumption and low
in real applications have driven the investigation of combining area costs, contradicting the intrinsic properties of deep learning
another rising technique, stochastic computing (SC), with deep neural networks since deep layered structures profoundly
learning systems to achieve low power costs. By far, operations contributes to both better learning performance and higher
successfully implemented include addition, multiplication, inner classification accuracy [1].
product, and other more complicated nonlinear functions such as To significantly save hardware complexities, especially for large
hyperbolic tangent (tanh) with linear finite state machines (FSM). deep learning systems, research interests have shifted towards the
The inner product implementation realizes convolution, a core of introduction of stochastic computing (SC), a well-known
neural networks, therefore encouraging SC- based deep learning computing technique that saves both energy and circuit size [2].
neural network implementations. Meanwhile, extremely long More importantly, it happens to be capable of implementing the
bitstream lengths are needed to achieve satisfying accuracy, core computation units incorporated in neural network systems,
especially for large-scale deep learning systems, causing latency and its innate error tolerance property further justifies its use.
issues. The integration of parallelism is thus considered in an
attempt to alleviate the latency issue. In this paper, an optimization Extensive and many of which successful experiments and research
to stochastic computing based deep learning system is proposed have been conducted to integrate stochastic computing into deep
by introducing parallel FSM implementations to replace serial learning systems [1][3][4]. Nonetheless, a simple integration of
ones generally used in previous works. Substituting serial linear stochastic computing and deep learning neural networks does not
FSMs with several parallel linear FSMs of the same size yet with present a flawless solution. It is already recognized that the long
shorter bitstream length, parallel FSM aims at trading hardware bitstreams used for stochastic circuit processing profoundly
for processing latency. The accuracy performance of a sample undermines the efficiency of computation [1].
parallel FSM unit is evaluated against its counterpart in serial Meanwhile, Ma and Lilja have proposed an overhaul on the linear
implementation before a case study verifies that the replacement FSM, by replacing a single serial linear FSM with a series of
sacrifices little accuracy, while reducing computing time parallel linear FSMs of the same size, i.e. with the same total
exponentially in actual deep learning system realizations. number of states, that utilizes shorter bitstreams [5]. This
substitution is an inspiration drawn from parallelism, which serves
CCS Concepts as a compromise of circuit delay and hardware cost.
• Computing Methodologies ➝ Parallel computing
methodologies ➝ Parallel algorithms • Computing In this paper, the influence of applying parallel FSM to stochastic
Methodologies ➝ Machine learning. computing based deep learning system and its performance with
different bitstream lengths is examined. This paper poses and
Keywords explores the following questions: First, how accurate the parallel
Parallel FSM; deep learning; stochastic computing; neural network FSM units are compared to the serially implemented ones. Second,
what impact parallel FSM integration exerts on actual deep
1. INTRODUCTION learning system applications.
As artificial intelligence gains more and more grounds in the field
of science and technology, the growth of the field casts much light 2. STOCHASTIC COMPUTING AND SC-
on deep learning systems. Deep learning has applications in a BASED DEEP LEARNING SYSTEM
broad range, including autonomous vehicles, smart appliances, 2.1 Stochastic Computing
wearable devices, and other mobile devices. However, a decent
In contrast to conventional computation with binary values,
Permission to make digital or hard copies of all or part of this work for
stochastic computing deals with binary bitstreams that represents a
personal or classroom use is granted without fee provided that copies are value in range [0, 1] by the probability of bits being 1 in the
not made or distributed for profit or commercial advantage and that copies stream. For instance, a bitstream of 01101010 with probability
bear this notice and the full citation on the first page. To copy otherwise, P(𝑋𝑋 = 1) = 4⁄8 = 0.5 represents the value 0.5. This is typically
or republish, to post on servers or to redistribute to lists, requires prior known as the unipolar encoding, and the range [0, 1] can be
specific permission and/or a fee. further extended to [-1, 1] to satisfy the need of representing
ICACS'20, January 6–8, 2020, Rabat, Morocco negative values with bipolar encoding [6]. The equivalent bipolar
© 2020 Association for Computing Machinery.
value 𝑌𝑌 represented by the same bitstream can be found from the
ACM ISBN 978-1-4503-7732-4/20/01…$15.00
represented unipolar value 𝑋𝑋 by 𝑌𝑌 = 2𝑋𝑋 − 1. Thus, for the example
DOI: https://doi.org/10.1145/3423390.3426727

22
above, bitstream 01101010 both represents 0.5 in unipolar and 2 × range of both encoding methods, the output is scaled, yet the
0.5 − 1 = 0 in bipolar encoding. scaling can be cancelled out by wire shifting at no additional cost.
Multiplication in unipolar and bipolar can be realized by a single
AND gate and XNOR gate respectively (Fig. 1) [7].

Figure 2. Scaled addition.


More complicated non-linear functions can be realized along with
(a) (b) the utilization of linear FSM of general form shown in Fig. 3 [7].
Figure 1. Multiplication in (a) unipolar (b) bipolar. For instance, the linear FSM corresponding to the hyperbolic
tangent function (tanh) is shown in Fig. 4 [7].
Similarly, addition in both encoding can be realized with a 2-to-1
multiplexer (MUX) and a select line with bitstream possibility P(𝑆𝑆
= 1) = 0.5, as shown in Fig. 2 [2]. However, due to the confined

Figure 3. General linear FSM state diagram structure.

Figure 4. Linear FSM for tanh function state diagram.


Multiple reasons support the use of stochastic computing. First, have already proposed a reconfigurable large-scale SC-based deep
the unique properties of stochastic computing empower energy- learning framework [1]. A reconfigurable neuron diagram is
efficient computation frameworks. Stochastic computing has proposed, in an attempt to replace as much arithmetic operation
already proved its strength in hardware complexity with successful units with stochastic computing elements as possible, as shown in
applications in image processing techniques [8]. Second, Fig. 5 [1].
stochastic computing manifests extraordinary error tolerance [2], a
property of great use when it comes to neural network training and
classification [1]. Moreover, the majority of core calculation units
in neural networks can find its counterpart in stochastic
computing. For instance, the convolution operation in
convolutional neural networks (ConvNN) can be realized with a
combination of stochastic computational elements of
multiplication and addition. The activation functions used in
ConvNN, for example rectified linear units (ReLU) and the
sigmoid function [1], can be replaced by the stochastic computing
implementation of the hyperbolic function (tanh) discussed above
[7].

2.2 SC-Based Deep Learning System


It is only up till recently that research efforts have turned to
stochastic computing based deep learning systems since both Figure. 5 Reconfigurable neuron diagram.
fields started to gain attention within a decade or so. Yet Ren et al.

23
3. PARALLEL VS. SERIAL LINEAR FSM
AND INTEGRATION INTO DEEP
LEARNING SYSTEM
3.1 Parallel vs. Serial Linear FSM
The parallel implementation of linear FSM is first proposed by Ma
and Lilja [5]. The idea originates from the concern that for a single
serial linear FSM operating on long bitstreams, for instance, a 16-
state FSM accepting bitstreams of length 1024 and initialized to
state 0, a total of 16 clock cycles is needed to step from state 0 to
state 15 for an input value close to 1, which incurs a 16⁄1024 =
1.56% error. In contrast, for a parallel FSM implementation
proposed by Ma and Lilja, 32 parallel copies of 16-state linear
FSMs are created for the operation [5]. The 1024-bit stream is
equally divided into 32-bit streams for each copy of FSM [5]. In
terms of initial states, a look up table (LUT) is created beforehand
storing a series of vectors that indicates the number of copies of Figure 7. Tanh(8, x) of parallel vs. Serial linear FSM (16-state).
FSM to start at each state, and a majority gate acting as the
estimator draws the correct vector out of the LUT [5]. Such
vectors are called steady-state distribution vectors, determined by It can be concluded that serial FSM implementations outperform
input values [9]. For instance, with an input value of 0.5, initial the parallel one in the long run, while the parallel implementation
states are evenly distributed, resulting in 2 FSMs starting at each does demonstrate its strength when the input values are closer to
state for the 32 parallel FSMs. With an input of 0, on the other one. Besides, for the 16-state FSM for example, parallel FSM
hand, all 32 FSMs start at state 0. In this way, the implementation requires a total of 32 clock cycles to execute, while the serial one
eschews the initialization issue and outperforms the serial FSM needs 1024 cycles. The reduction of delay becomes more
implementation, especially for inputs closer to 1. significant as bitstream length increases.
To illustrate the effect, a simulation on both the serial and parallel 3.2 Integration of Parallel FSM to Deep
implementation is conducted on the stochastic tanh function in
Fig. 6 and Fig. 7, and the mean squared errors (MSE) are
Learning Framework
calculated for both an 8-state and the 16-state FSM example in The overall structure of the neuron remains unchanged, as shown
Table 1. For statistical significance, each implementation is run 10 in Fig. 5, while all the tanh blocks are replaced with the
times for each input value. aforementioned parallel FSM. The actual configuration of the
parallel implementation, i.e. the number of copies of parallel
FSMs, the total number of steady-state distribution vectors, etc.
vary from case to case. It is worth noting that under the same
bitstream length, different parameters in parallel FSM
implementation may lead to slightly different results. The total
number of states for each copy of parallel FSM may differ from
that of the serial FSM, where the actual outputs may exhibit higher
accuracy.

Figure 6. Tanh(4, x) of parallel vs. serial linear FSM (8-state).

Table 1. MSE of parallel vs. serial linear FSM simulation


MSE Serial Parallel
8-state 0.003 0.004
16-state 0.002 0.003
Figure 8. 16-state parallel tanh FSM vs. tanh(16, x).

For example, for the 16-state parallel vs serial FSM simulation


above, the parallel implementation actually resembles the function
tanh(32/2, x) = tanh(16, x), which should be realized by a serial

24
FSM with a total of 32 states, better with an MSE of 0.001 (Fig. group. Since the simulation is not to verify the robustness of the
8). SC based neural network, but the influence of integration of
parallel FSM, the system was trained just to an acceptable
4. CASE STUDY: MNIST DIGIT accuracy. The codes for software simulation are developed based
CLASSIFICATION WITH SC- BASED DEEP on [11]. All stochastic linear FSM modules are rewritten to
simulate parallel FSM implementations.
LEARNING NEURAL NETWORK AND
PARALLEL FSM IMPLEMENTATION Side-by-side comparison of parallel vs. serial FSM based network
In this section, a handwritten digit classification problem is solved training on output layer weights and classification accuracy
with database MNIST [10]. The database comprises 60,000 against training epochs under identical training parameters are
sample images in the training set and 10,000 samples in the testing shown in Fig. 9.

(a)

(b)
Figure 9. Network training results comparison: (a) serial FSM (b) parallel FSM.

Figure 10. Case study: parallel vs. serial FSM classification accuracy

25
The comparison clearly shows the successful derivation of output IEEE International Conference on Rebooting Computing
layer weights and ability to reach comparable classification (ICRC), San Diego, CA, pp. 1-7, 2016.
accuracy for parallel FSMs. Even though the two simulations [2] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,”
show slight differences, it is within the normal range considering ACM Trans. Embed. Comput. Syst, vol. 12, no. 92, May
the nature of neural network training (Fig. 10). 2013.
Due to software limitations, only ten tests are done for each [3] C. Lammie and M. R. Azghadi, "Stochastic Computing for
different bit-stream length, and the accuracy percentage can be a Low- Power and High-Speed Deep Learning on FPGA,"
little circumstantial. However, the general trend still holds that SC 2019 IEEE International Symposium on Circuits and Systems
based deep learning systems with serial and parallel FSM (ISCAS), Sapporo, Japan, pp. 1-5, 2019.
implementation are on par in terms of classification performance,
yet the reduction in time delay by parallelism is significant. The [4] A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu and
actual improvement in delay is not measurable in software W. J. Gross, “VLSI Implementation of Deep Neural Network
simulations, and an implementation of the system on a filed Using Integral Stochastic Computing,” in IEEE Transactions
programming gate array (FPGA) board is necessary to quantify the on Very Large Scale Integration (VLSI) Systems, vol. 25, no.
improvement. 10, pp. 2688-2699, Oct. 2017.
[5] C. Ma and D. J. Lilja, “Parallel implementation of finite state
5. CONCLUSION machines for reducing the latency of stochastic computing,”
In this paper, the influence of parallel linear FSM implementation 2018 19th International Symposium on Quality Electronic
of linear FSMs used in stochastic based deep learning systems is Design (ISQED), Santa Clara, CA, pp. 335-340, 2018.
examined. The introduction of parallelism mainly aims at
[6] B. Gaines, “Stochastic computing systems,” Advances in
alleviating the striking latency issues induced by long bitstreams
Information Systems Science, vol. 2, no. 2, pp. 37–172, 1969.
used by stochastic computing. Both the individual module
comparison in Section 3 and the case study in Section 4 proves the [7] B. D. Brown and H. C. Card, “Stochastic neural computation
introduction of parallel FSM comparable to the original systems, I: computational elements,” IEEE Trans. Comput., vol. 50,
especially for longer bitstreams. This paper is still lacking in that pp. 891-905, Sept. 2001.
the case study has not been implemented on an actual FPGA board [8] P. Li, D. J. Lilja et al., “Computation on stochastic bit
to time the difference in latency. Future work may include actual streams digital image processing case studies,” IEEE
hardware simulations and implementations to quantify the latency Transactions on Very Large Scale Integration (VLSI)
improvements with such substitution. Systems, vol. 22, no. 3, pp. 449–462, 2014.
6. ACKNOWLEDGEMENT [9] A. A. Markov, “Extension of the limit theorems of
This paper would not have been possible without the support of probability theory to a sum of variables connected in a
Professor Yiyu Shi from University of Notre Dame. His well- chain,” reprinted in Appendix B of: R. Howard. Dynamic
organized and explained introduction to machine learning and Probabilistic Systems, volume 1: Markov Chains. John Wiley
neural networks provides a head start for my implementations of and Sons, 1971.
parallel FSM and experiments apropos the deep learning systems. [10] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges.
I would also like to express my gratitude towards two teaching The MNIST database of handwritten digits.
assistants, Yunlong Jia and Qi Wang, for their prompt and http://yann.lecun.com/exdb/ mnist.
valuable advice whenever questions arise during the research.
[11] A. E. Solomou, "adamsolomou/SC-DNN", GitHub, 2020.
7. REFERENCES [Online]. Available: https://github.com/adamsolomou/SC-
DNN.
[1] A. Ren, Z. Li et al., “Designing reconfigurable large-scale
deep learning systems using stochastic computing,” 2016

26

You might also like