Optimizacion
Optimizacion
22
above, bitstream 01101010 both represents 0.5 in unipolar and 2 × range of both encoding methods, the output is scaled, yet the
0.5 − 1 = 0 in bipolar encoding. scaling can be cancelled out by wire shifting at no additional cost.
Multiplication in unipolar and bipolar can be realized by a single
AND gate and XNOR gate respectively (Fig. 1) [7].
23
3. PARALLEL VS. SERIAL LINEAR FSM
AND INTEGRATION INTO DEEP
LEARNING SYSTEM
3.1 Parallel vs. Serial Linear FSM
The parallel implementation of linear FSM is first proposed by Ma
and Lilja [5]. The idea originates from the concern that for a single
serial linear FSM operating on long bitstreams, for instance, a 16-
state FSM accepting bitstreams of length 1024 and initialized to
state 0, a total of 16 clock cycles is needed to step from state 0 to
state 15 for an input value close to 1, which incurs a 16⁄1024 =
1.56% error. In contrast, for a parallel FSM implementation
proposed by Ma and Lilja, 32 parallel copies of 16-state linear
FSMs are created for the operation [5]. The 1024-bit stream is
equally divided into 32-bit streams for each copy of FSM [5]. In
terms of initial states, a look up table (LUT) is created beforehand
storing a series of vectors that indicates the number of copies of Figure 7. Tanh(8, x) of parallel vs. Serial linear FSM (16-state).
FSM to start at each state, and a majority gate acting as the
estimator draws the correct vector out of the LUT [5]. Such
vectors are called steady-state distribution vectors, determined by It can be concluded that serial FSM implementations outperform
input values [9]. For instance, with an input value of 0.5, initial the parallel one in the long run, while the parallel implementation
states are evenly distributed, resulting in 2 FSMs starting at each does demonstrate its strength when the input values are closer to
state for the 32 parallel FSMs. With an input of 0, on the other one. Besides, for the 16-state FSM for example, parallel FSM
hand, all 32 FSMs start at state 0. In this way, the implementation requires a total of 32 clock cycles to execute, while the serial one
eschews the initialization issue and outperforms the serial FSM needs 1024 cycles. The reduction of delay becomes more
implementation, especially for inputs closer to 1. significant as bitstream length increases.
To illustrate the effect, a simulation on both the serial and parallel 3.2 Integration of Parallel FSM to Deep
implementation is conducted on the stochastic tanh function in
Fig. 6 and Fig. 7, and the mean squared errors (MSE) are
Learning Framework
calculated for both an 8-state and the 16-state FSM example in The overall structure of the neuron remains unchanged, as shown
Table 1. For statistical significance, each implementation is run 10 in Fig. 5, while all the tanh blocks are replaced with the
times for each input value. aforementioned parallel FSM. The actual configuration of the
parallel implementation, i.e. the number of copies of parallel
FSMs, the total number of steady-state distribution vectors, etc.
vary from case to case. It is worth noting that under the same
bitstream length, different parameters in parallel FSM
implementation may lead to slightly different results. The total
number of states for each copy of parallel FSM may differ from
that of the serial FSM, where the actual outputs may exhibit higher
accuracy.
24
FSM with a total of 32 states, better with an MSE of 0.001 (Fig. group. Since the simulation is not to verify the robustness of the
8). SC based neural network, but the influence of integration of
parallel FSM, the system was trained just to an acceptable
4. CASE STUDY: MNIST DIGIT accuracy. The codes for software simulation are developed based
CLASSIFICATION WITH SC- BASED DEEP on [11]. All stochastic linear FSM modules are rewritten to
simulate parallel FSM implementations.
LEARNING NEURAL NETWORK AND
PARALLEL FSM IMPLEMENTATION Side-by-side comparison of parallel vs. serial FSM based network
In this section, a handwritten digit classification problem is solved training on output layer weights and classification accuracy
with database MNIST [10]. The database comprises 60,000 against training epochs under identical training parameters are
sample images in the training set and 10,000 samples in the testing shown in Fig. 9.
(a)
(b)
Figure 9. Network training results comparison: (a) serial FSM (b) parallel FSM.
Figure 10. Case study: parallel vs. serial FSM classification accuracy
25
The comparison clearly shows the successful derivation of output IEEE International Conference on Rebooting Computing
layer weights and ability to reach comparable classification (ICRC), San Diego, CA, pp. 1-7, 2016.
accuracy for parallel FSMs. Even though the two simulations [2] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,”
show slight differences, it is within the normal range considering ACM Trans. Embed. Comput. Syst, vol. 12, no. 92, May
the nature of neural network training (Fig. 10). 2013.
Due to software limitations, only ten tests are done for each [3] C. Lammie and M. R. Azghadi, "Stochastic Computing for
different bit-stream length, and the accuracy percentage can be a Low- Power and High-Speed Deep Learning on FPGA,"
little circumstantial. However, the general trend still holds that SC 2019 IEEE International Symposium on Circuits and Systems
based deep learning systems with serial and parallel FSM (ISCAS), Sapporo, Japan, pp. 1-5, 2019.
implementation are on par in terms of classification performance,
yet the reduction in time delay by parallelism is significant. The [4] A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu and
actual improvement in delay is not measurable in software W. J. Gross, “VLSI Implementation of Deep Neural Network
simulations, and an implementation of the system on a filed Using Integral Stochastic Computing,” in IEEE Transactions
programming gate array (FPGA) board is necessary to quantify the on Very Large Scale Integration (VLSI) Systems, vol. 25, no.
improvement. 10, pp. 2688-2699, Oct. 2017.
[5] C. Ma and D. J. Lilja, “Parallel implementation of finite state
5. CONCLUSION machines for reducing the latency of stochastic computing,”
In this paper, the influence of parallel linear FSM implementation 2018 19th International Symposium on Quality Electronic
of linear FSMs used in stochastic based deep learning systems is Design (ISQED), Santa Clara, CA, pp. 335-340, 2018.
examined. The introduction of parallelism mainly aims at
[6] B. Gaines, “Stochastic computing systems,” Advances in
alleviating the striking latency issues induced by long bitstreams
Information Systems Science, vol. 2, no. 2, pp. 37–172, 1969.
used by stochastic computing. Both the individual module
comparison in Section 3 and the case study in Section 4 proves the [7] B. D. Brown and H. C. Card, “Stochastic neural computation
introduction of parallel FSM comparable to the original systems, I: computational elements,” IEEE Trans. Comput., vol. 50,
especially for longer bitstreams. This paper is still lacking in that pp. 891-905, Sept. 2001.
the case study has not been implemented on an actual FPGA board [8] P. Li, D. J. Lilja et al., “Computation on stochastic bit
to time the difference in latency. Future work may include actual streams digital image processing case studies,” IEEE
hardware simulations and implementations to quantify the latency Transactions on Very Large Scale Integration (VLSI)
improvements with such substitution. Systems, vol. 22, no. 3, pp. 449–462, 2014.
6. ACKNOWLEDGEMENT [9] A. A. Markov, “Extension of the limit theorems of
This paper would not have been possible without the support of probability theory to a sum of variables connected in a
Professor Yiyu Shi from University of Notre Dame. His well- chain,” reprinted in Appendix B of: R. Howard. Dynamic
organized and explained introduction to machine learning and Probabilistic Systems, volume 1: Markov Chains. John Wiley
neural networks provides a head start for my implementations of and Sons, 1971.
parallel FSM and experiments apropos the deep learning systems. [10] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges.
I would also like to express my gratitude towards two teaching The MNIST database of handwritten digits.
assistants, Yunlong Jia and Qi Wang, for their prompt and http://yann.lecun.com/exdb/ mnist.
valuable advice whenever questions arise during the research.
[11] A. E. Solomou, "adamsolomou/SC-DNN", GitHub, 2020.
7. REFERENCES [Online]. Available: https://github.com/adamsolomou/SC-
DNN.
[1] A. Ren, Z. Li et al., “Designing reconfigurable large-scale
deep learning systems using stochastic computing,” 2016
26