FPGA Based Emulation Environment For
FPGA Based Emulation Environment For
Neuromorphic Architectures
Spencer Valancius∗ , Edward Richter∗ , Ruben Purdy∗ , Kris Rockowitz∗ , Michael Inouye∗ , Joshua Mack∗ ,
Nirmal Kumbhare∗ , Kaitlin Fair† , John Mixter‡ and Ali Akoglu∗
∗ Departmentof Electrical and Computer Engineering, University of Arizona, Tucson, AZ, 85719 USA
{svalancius12, edwardrichter, rubenpurdy, rockowitzks, mikesinouye, jmack2545, nirmalk, akoglu}@email.arizona.edu
† Air Force Research Labs, Florida, 32542 USA
[email protected]
‡ Raytheon Missile Systems, Tucson, AZ, 85747 USA
[email protected]
arXiv:2004.06061v1 [cs.ET] 8 Apr 2020
Abstract—Neuromorphic architectures such as IBM’s precision, synaptic delay, the number of neurons and axons
TrueNorth and Intel’s Loihi have been introduced as platforms in a core, number of cores, neuron count per core, network
for energy efficient spiking neural network execution. However, topology, and the constraints used during training networks
there is no framework that allows for rapidly experimenting with
neuromorphic architectures and studying the trade space on for deployment onto the target neuromorphic architecture.
hardware performance and network accuracy. Fundamentally, There is a need for an open-source configurable emulation
this creates a barrier to entry for hardware designers looking environment for hardware architects and application engineers
to explore neuromorphic architectures. In this paper we to investigate performance bottlenecks and accordingly alter
present an open-source FPGA based emulation environment the architecture by investigating the impact of their design de-
for neuromorphic computing research. We prototype IBM’s
TrueNorth architecture as a reference design and discuss cisions on hardware performance through trend based analysis.
FPGA specific design decisions made when implementing Such design space exploration and prototyping based efforts
and integrating it’s core components. We conduct resource are not feasible without an emulation environment as these
utilization analysis and realize a streaming-enabled TrueNorth neuromorphic chips are designed as ASICs.
architecture on the Zynq UltraScale+ MPSoC. We then In this study we present a parameterized and configurable
perform functional verification by implementing networks for
MNIST dataset and vector matrix multiplication (VMM) in emulation platform that serves as a basis for supporting other
our emulation environment and present an accuracy-based neuromorphic architectures or investigating new architectures
comparison based on the same networks generated using IBM’s targeted for different application domains. We recreate and
Compass simulation environment. We demonstrate the utility implement the TrueNorth architecture as a reference design on
of our emulation environment for hardware designers and the Xilinx Zynq UltraScale+ MPSoC ZCU102. We validate
application engineers by altering the neuron behavior for VMM
mapping, which is, to the best of our knowledge, not feasible the functionality of our emulation environment using both
with any other tool including IBM’s Compass environment. The the MNIST dataset and vector matrix multiplication as case
proposed parameterized and configurable emulation platform studies. For the MNIST dataset we implement a network by
serves as a basis for expanding its features to support emerging Esser et al. [6] in our environment and compare the results of
architectures, studying hypothetical neuromorphic architectures, the two networks. For the case of vector matrix multiplication
or rapidly converging to hardware configuration through
incremental changes based on bottlenecks as they become (VMM), we replicate VMM mapping method of Fair et al. [8]
apparent during application mapping process. and compare the results against similar networks generated
Keywords: Neuromorphic computing, Emulation, FPGA. using IBM’s Compass environment [15]. We then demonstrate
the architectural prototyping capabilities of our environment
I. I NTRODUCTION by introducing a single change to the neuron block component.
Spiking neural network (SNN) architectures have been This alteration, without accuracy degradation to either the
proposed with the goal of creating non-von Neumann ar- MNIST or VMM case studies, reduces the resource require-
chitectures that emphasize the strengths of biologically in- ments of the VMM networks by 50%.
spired neural networks: low-power, high parallelism, and fast Even though TrueNorth and Loihi architectures are different
complex computations [12], [16]. IBM’s TrueNorth Chip [1] in terms of packet processing, SNN mapping, core architec-
and Intel’s Loihi [5] are examples of such architectures for ture, and core synchronization, our emulation environment
modeling leaky-integrate-and-fire neurons with the capability allows tuning aforementioned key configuration parameters to
of implementing multiple types of dynamic and stochastic execute SNNs targeted for Loihi. The parameterized design
neuron models. Neuromorphic computing architectures have a enables the manipulation of core components, without the need
number of configuration parameters that are not inherent to the to recreate the entire design from scratch. This allows for a
hardware such as the number of weights a neuron can have, user to selectively utilize TrueNorth components that interface
the bitwidth of these weights, synaptic weight memory and with their own unique implementations, such as using the
Fig. 2: Neuron Block
the memory module to read from, as well as notifies the token (a) Our Router
controller when all rows have been processed. By separating
the controller from the memory module, we assist the synthesis
tool in mapping the memory module to the FPGA’s BRAM
more efficiently. This results with implementing the core sram
module with only 5.5 BRAM blocks per core, which would
have required 386 BRAM blocks otherwise.
Each neuron accounts for 386 bits of information, therefore
each emulated core requires 98,816 bits (256∗386) to store all
the neuron information. However, 5.5 BRAM elements offers
202,752 bits of memory, which is much more than the 98,816
bits that we are storing for a single core sram. Despite using
only around 50% of the total memory storage available in the
provided 5.5 BRAM blocks, our core sram needs to be able (b) IBM TrueNorth Router
to read all 386 bits for a single neuron in a single clock cycle. Fig. 3: Differences between our router and TrueNorth router:
Our core sram uses 5 36-Kb BRAM blocks in a 512 × 72 Synchronous design and reorganizing buffers by moving them
configuration and a single 18-Kb BRAM block in a 512 × 36 after the merge simplified back-pressure logic and increased
configuration to match this need. our ability to have high throughput in times of high congestion.
C. Router
The router is responsible for the inter-core communication times of high congestion, the buffers are necessary to achieve
in the TrueNorth chip and enables delivering spike packets high throughput. Second, as packets travel through the routing
between adjacent cores from source neurons to destination network, the FIFO buffers that capture packets within each
axons. Each spike in the TrueNorth network is represented forward module become full. This is addressed by applying
as a packet, which contains information regarding number of a form of back-pressure into the network as discussed by
cores to travel horizontally (∆x) [9 bits] and vertically (∆y) Akopyan et al. [1]. Buffers enable back-pressure to ensure
[9 bits], which axon in the core to be delivered to [8 bits], that packets are not lost when traveling through the network.
and which tick to be delivered on [4 bits]. Packets travel first A significant difference between our implementation and
horizontally, and then vertically across the two-dimensional the original router implementation in IBM’s TrueNorth shown
mesh of cores until they arrive at the destination core, where in Figure 3b is the placement and number of buffers. In the
they are sent to the scheduler. At the final destination, the original implementation, each cardinal direction has a single
scheduler uses the remaining bit values to determine which buffer to store inputs arriving at that forwarding direction.
axon and tick instance to save the spike to. This setup requires complicated backpressure logic as each
Figure 3a shows our implementation of the router com- buffer can send packets in multiple directions. For example,
ponent. Each router has forward east, forward west, forward the buffers in forward east can go to forward north, forward
north, forward south, from local, and to local modules, which south, or to the eastern core’s forward east module. There-
are used to communicate with both the internal modules of the fore, in order to implement backpressure, the buffer needs
core as well as the adjacent cores in every cardinal direction. to receive feedback from these three possible locations. Our
Forward east, forward west, forward north, and forward south implementation increases the number of buffers. Rather than
each have one to three FIFO buffers which are necessary for having a single buffer that buffers the input into the forward
two reasons. First, when mapping a non-trivial application to module, we have a buffer for each module’s output. This places
TrueNorth, the number of packets simultaneously traveling each buffer in-between two merges (except for the two buffers
through the network on-chip can be significantly large. At in from local). The merge at the output of the buffer will
Fig. 4: A synchronous scheduler design replaces 16 asyn-
chronous control blocks with a counter.
able to achieve this state reduction by first collapsing the 256 ARM Cortex-A53 CPU integrated with a Programmable Logic
states used in TrueNorth for the input axon evaluation into a (PL). The neuron parameters for the core SRAM, neuron
single state loop (state 4 in Figure 5), where the number of instructions of each core, and the spikes are first loaded
loop iterations is equal to the number of input axons of 256. into the emulator. These three files are generated offline for
Among the remaining thirteen states used in the TrueNorth deploying the network on the FPGA after all constraining and
token controller, we merge five states that deal with address training has been completed using the constrain-then-train
updating to the scheduler and core sram components (state methodology described by Esser et al. [7]. In order to
1), and merge two states that deal with updating the current move data in and out of our platform, we utilize the DMA
neuron’s potential into the core sram and sending the spike functionality of the Zynq SoC. The emulation utilizes both
information to the router (state 5). We re-purpose the spike the PS and PL resources of the MPSoC platform. Overall
debugging state in the original TrueNorth design to instead architecture of the emulation environment is composed of two
switch off the valid bit signal sent to our router after we deliver ARM CPU cores on the PS side acting as the host threads,
a spike packet from the current neuron (state 6). For our first and on the PL side, a DMA engine, buffer, and the TrueNorth
and last state we retain their original uses as described by module. The host thread of the first ARM CPU core reads
TrueNorth. Lastly, we are able to remove three states from packets from a binary file, which can be shared using an SD
the TrueNorth design, as they correspond to states that don’t card. The packets are then written to shared memory, which
impact the overall behavior of our emulation design. the PL can access using the DMA engine. The packets are
fed from memory into the TrueNorth module, where they are
F. Output Buffer
buffered to be read at each tick. While running, the output
An additional component required by our emulation envi- packets are also buffered, and at the end of each tick they are
ronment is the output buffer. In Compass [15], cores desig- written to a separate part of the shared memory. The second
nated with output neurons send packets to an output buffer. ARM core reads from the shared memory and writes the
This buffer is used to retrieve all output spike packets and output packets back to the SD card.
send them to the user in a single tick instance, rather than
the user receiving spike packets at irregular intervals. This B. Hardware Implementation Results
extra buffer component adds an extra tick instance between
the outputs of a TrueNorth core and the user. To ensure Table II shows the resource utilization and timing analysis
the output timings match up with the Compass results, we on the Zynq Ultrascale+ for two network sizes. We use the
constructed a simple output buffer component that is attached single and five core implementations for functional verification
to the emulated network. Output spikes are retrieved by this against the VMM and MNIST reference designs respectively.
component, accumulated during a tick instance, and then sent The single core occupies 0.72% of the logic resources, 0.21%
to the user at the start of the next tick instance. of the logic-memory resources, and 0.60% of the BRAM
resources. When we scale the network to the five cores,
III. V ERIFICATION AND S CALABILITY resource utilization increases linearly, occupying 3.88% of
In this section, we first present hardware setup and FPGA logic resources, 1.06% of logic-memory resources, and 3.02%
resource utilization for our TrueNorth prototype. Next, we of the BRAM resources. Core computations involve 9-bit
present our approach of functional verification between our signed weight addition in the neuron block, along with 9-bit
FPGA prototype and IBM’s TrueNorth simulator, Compass signed increment and decrement operations in the router block.
[15], by implementing a 9-bit signed vector-matrix multipli- Each core operates at global tick rate of 1KHz [1].
cation (VMM) algorithm. We then perform functional ver- We show the resource utilization trend with respect to
ification by comparing results on the MNIST dataset with increase in the number of cores in Table III. We sweep
published TrueNorth based implementation results [6]. the resource usage space by beginning with our single core
implementation. We then expand out in the x and y directions
A. Streaming Framework of our 2D network grid maintaining a square network. We
We use the Xilinx Zynq Ultrascale+ MPSoC observe that the network scales in a seemingly linear fashion,
(XCZU9EG2FFVB1156) as the implementation platform, with the primary resource demands being on the LUT and
which consists of a Programmable System (PS) of Quad-core BRAM components. We are able to create a 110 core (10x11)
TABLE III: Hardware resource usage with respect to the by implementing a feedback system that reroutes spikes back
number of emulated TrueNorth cores. LUTs determine the to the core and drives the negative neuron potential back to
scalability, reaching nearly 98% utilization at 110 cores. zero [8]. This feedback system requires an additional doubling
NETWORK LUT LUTRAM FF BRAM of the number of neurons.
SIZE (%) (%) (%) (%) The scalability of VMM mapping to TrueNorth is limited
1 8.40 0.31 3.15 0.60
4 9.99 0.73 3.68 1.81
by this asymmetry of neuron potential reset thresholds. The
9 14.03 1.79 5.05 4.82 duplication of the number of axons and neurons severely
16 19.75 3.27 7.01 9.05 limits scalability, quickly exhausts resources on TrueNorth for
25 27.15 5.17 9.55 14.47
36 36.24 7.49 12.68 21.11
larger VMM problems and requires a cluster of TrueNorth
49 47.01 10.23 16.40 28.95 chips to map convolution, locally competitive algorithm, least
64 59.47 13.40 20.70 37.99 squares minimization, or support vector machine training [8]
81 73.62 16.99 25.59 48.25
100 89.45 21.00 31.07 59.70
types of applications. Resorting to a resource replication
110 97.78 23.11 33.95 65.73 type of workaround to accommodate signed multiplication
is inevitable when restricted by the fixed architecture. We
network, bounded by LUT utilization, as a rectangular grid on identify this problem as a key case study for demonstrating
the Zynq Ultrascale+ XCZU9EG. the utility of our FPGA based emulation environment, where
an application engineer has the ability to change the hardware
C. Vector Matrix Multiplication Verification behavior and eliminate the need for resource duplication
As shown by Fair et al. [8], the mapping of vector-matrix completely. In our emulation environment, the imbalance
multiplication (VMM) onto TrueNorth using Compass [15], of equality operators is quickly resolved by modifying the
IBM’s TrueNorth simulator, spreads computation across net- negative threshold behavior such that it uses a "≤" comparison
work, core and neuron components, and runs for hundreds rather than a "<" comparison. Despite the simplicity of this
of ticks. Therefore, it is an ideal application for testing our change in hardware, it is infeasible within Compass due to
emulation environment. Furthermore, VMM has also proven the fixed-nature of the TrueNorth architecture. The proposed
to be a core building block in implementing multiple sophisti- symmetric threshold based hardware modification eliminates
cated algorithms on TrueNorth such as the locally-competitive the need for the feedback system, and it enables resource
algorithm [8], Word2Vec word similarity calculation [14], and reduction,which we discuss next.
the neural engineering framework [9]. We use the 9-bit signed To demonstrate the results, we map an 8 × 8 matrix, the
VMM to verify the behavioral functionality of our emulation largest which fits a single 256 × 256 core with feedback.
prototype against its implementation in Compass. We created Each column requires 8 neurons for the positive representation,
100 random matrices ranging from 2x3 to 8x8 and a random 8 neurons for the negative representation, and 16 neurons
vector of 9-bit signed integers of the appropriate size for each for feedback. This is 32 neurons per column, multiplied
matrix. Each vector-matrix pair was mapped to Compass [15] by 8 columns, yielding 256 neurons. 16 feedback neurons
using the method proposed by Fair [8]. We then mapped the multiplied by 8 columns, yields 128 feedback neurons, which
same matrix-vector pairs to our FPGA emulation and found a necessitate 128 axons by which to connect. The maximum
one-to-one match between the two. vector of 1 × 8 requires 1 axon per column for positive and
1 axon per column for negative representation. Duplicating
D. Modifying Neuron Behaviour for Efficient VMM Mapping this number ensures correspondence between signed inputs
Mapping signed VMM requires representation of positive and signed matrix values resulting with 4 axons per column
and negative values. As the rate encoded input spikes lack sign, and a total of 32 axons. The 128 feedback axons, required by
to make signed VMM mappable to the TrueNorth, Fair et al. the 8 × 8 matrix, are added to the 32 input axons, creating a
[8] duplicates the axons in a core, dividing them into positive 160 axon, 256 neuron core. Eliminating the feedback system,
and negative groups, where positive and negative input spikes leaves behind a 32 axon, 128 neuron core to solve the same
are routed to their respective groups. Similarly, the neurons VMM problem and reduces the number of neurons by 50%.
are duplicated and divided, allowing them to represent positive In order to validate our resource reduction analysis, we
and negative outputs from respective connected axons. Neuron implement the signed VMM mapping method of Fair et al. [8]
block operations proceed as previously described, facilitating for the 1 × 8 vector and 8 × 8 matrix on the reference
simultaneous positive and negative value operation without architecture that is emulated using the Zynq Ultrascale+ MP-
previous knowledge of sign. SoC. We then implement the same VMM problem on the
While the positive threshold is evaluated using ≥, the neg- proposed architecture that supports symmetric threshold and
ative threshold uses the < operator as illustrated in Figure 2. eliminates the feedback loop. We show the resource usage of
Uncorrected, this asymmetry allows the potential of a neuron the functionally equivalent VMM mappings on the reference
to remain negative when it should otherwise be reset to zero, and proposed architectures in Table V. Elimination of the
thus producing an incorrect number of output spikes. This is feedback loop removes half the dimensions of a standard core,
depicted in Table IV. Despite the neuron and axon duplication in turn reduces the necessary bit allotment for the scheduler
that allows signed VMM, correct output can only be achieved and core sram. As shown in Table III, the standard 256x256
TABLE IV: Table depicts the tick by tick interaction between incoming spikes, connection weights, neuron potentials and
output spikes depending upon axon type. We see that the symmetry of the reset thresholds affects the state of the neuron
potential after successive ticks, with the asymmetric potential remaining negative until a positive value drives it back toward
zero. In applications like VMM, the positive (+) and negative (-) representations rely on identical behavior of positive and
negative potential resets to allow simultaneous positive and negative values to be calculated and represented by the neurons.
Due to the configurability of our emulation environment, the feedback system used to correct the asymmetry is easily discarded.
Asymmetric Symmetric
Tick Axon Spike Weight(+, -) Potential(+) Potential(-) Output(+) Output(-) Potential(+) Potential(-) Output(+) Output(-)
1 0 1 1, -1 1 -1 1 0 1 1 1 0
2 X 0 X 0 -1 0 0 0 0 0 0
3 1 1 -1, 1 -1 0 0 0 -1 1 0 1
Fig. 7: For the MNIST data set we modify our Output Core
to output all class votes as they are accumulated. For the first
six ticks of the data set, we generate the resulting votes in the
above histogram. For instances where a tie occurs, the Output
Core is set to select the first instance.
having a size of 28x28 pixels, to ensure that our input layer
of cores are fully connected, we split the images into four
sections of 16x16 windows with each window separated using
a stride of 12 pixels. Each core of our input layer only uses 64
Fig. 6: 5 core network implementation with four 16x16
out of the 256 total neurons available to them, as each input
windows with a stride of 12 represented by blocks a-d. These
core represents one-quarter of the image. The image splitting
generate a 256 input fan-in for the input layer of cores in our
method ensures each quarter is weighted evenly within the
network. Input layer only uses 64 of the 256 possible neurons
classification core. The classification core then uses only 250
and outputs those to the classification core in the next layer.
out of the total 256 neurons to evenly distribute 25 neurons
per class for each of the ten classes within MNIST data set.
core requires 5.5 BRAMs, which the core sram occupies.
When running our implementation, the output core gen-
For the 8 × 8 VMM problem, based on the reference design
erates all votes for each class in the MNIST as they are
with the feedback loop, we note the smaller dimensions of
accumulated. This allows us to produce a histogram similar
160 × 256, requires 4 BRAMs. For the proposed symmetric
to Figure 7, which shows the number of votes for each
threshold based architecture, the 32 × 128 core requires 2
digit across multiple ticks. By comparing against Compass
BRAMs, which confirms the expected 50% reduction. Instead
we verified that these histograms were matching and the
of an 8-bit counter for the scheduler that counts up to 256,
correct digit was being selected. Our five core implementation
the proposed design requires a 5-bit counter to index each
achieves an accuracy of 96.28% on the MNIST data set,
of the 32 axons. This enables more efficient mapping of the
which is comparable to the accuracy achieved by Yepes et.
scheduler to the LUT-RAMs, reducing total utilization by 75%.
al [13]. Our emulation environment takes 10 seconds to fully
Additionally, removing the feedback system not only permits
infer the 10,000 testing images of the MNIST. An in-house
a matrix capacity double the previous to occupy a single core,
serial implementation of the same emulation environment
but also reduces the critical path delay by 21.8%, further
takes around 2 hours on an Intel Xeon processor (3GHz, 32GB
bolstering throughput and scalability.
RAM) to fully infer the dataset. We find that even with the
E. MNIST Verification symmetric threshold, our accuracy is unchanged.
In this experiment we implement a five core network that IV. C ONCLUSION
replicates the design introduced by Esser et al. [6], illustrated In this paper we present our approach to implementing an
in Figure 6, while using the training methodology proposed FPGA-based neuromorphic architecture emulation platform.
by Yepes et al. [13]. Due to the MNIST data set images We use IBM’s TrueNorth as a reference and discuss our
hardware design decisions for each architectural component and D. S. Modha. TrueNorth: Design and tool flow of a 65 mw
to make it feasible to implement on the FPGA. We conduct 1 million neuron programmable neurosynaptic chip. IEEE Transac-
tions on Computer-Aided Design of Integrated Circuits and Systems,
hardware resource usage analysis, validate the functionality 34(10):1537–1557, Oct 2015.
of our emulation environment and demonstrate its utility [2] J. V. Arthur, P. A. Merolla, F. Akopyan, R. Alvarez, A. Cassidy,
through case studies based on comparisons with respect to S. Chandra, S. K. Esser, N. Imam, W. Risk, D. B. D. Rubin, R. Manohar,
and D. S. Modha. Building block of a programmable neuromorphic
the published results. To the best of our knowledge this is the substrate: A digital neurosynaptic core. In The 2012 International Joint
first academic work on FPGA based emulation environment Conference on Neural Networks (IJCNN), pages 1–8, June 2012.
for simulating the clusters of leaky-integrate-and-fire neuron [3] A. S. Cassidy, P. Merolla, J. V. Arthur, S. K. Esser, B. Jackson,
R. Alvarez-Icaza, P. Datta, J. Sawada, T. M. Wong, V. Feldman, A. Amir,
models integrated with the principle router, scheduler, and D. B. D. Rubin, F. Akopyan, E. McQuinn, W. P. Risk, and D. S. Modha.
memory management components. Unlike other approaches Cognitive computing building block: A versatile and efficient digital
(e.g., [4], [10], [11], [17]) that are presented towards achieving neuron model for neurosynaptic cores. In The 2013 International Joint
Conference on Neural Networks (IJCNN), pages 1–10, Aug 2013.
large scale spiking neural network simulations, the proposed [4] T. Chou, H. J. Kashyap, J. Xing, S. Listopad, E. L. Rounds, M. Beyeler,
open-source, parameterized and modular emulation environ- N. Dutt, and J. L. Krichmar. Carlsim 4: An open source library for
ment serves as a basis to conduct hardware architecture large scale, biologically detailed spiking neural network simulation using
heterogeneous clusters. In 2018 International Joint Conference on
research for neuromorphic computing and investigate the trade Neural Networks (IJCNN), pages 1–8, July 2018.
space between mapping strategies, hardware performance and [5] M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao, S. H. Choday,
accuracy for the target applications. G. Dimou, P. Joshi, N. Imam, S. Jain, Y. Liao, C. Lin, A. Lines, R. Liu,
D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan, Y. Weng,
Our FPGA-based emulation environment replaces the "glob- A. Wild, Y. Yang, and H. Wang. Loihi: A Neuromorphic Manycore
ally synchronous-locally asynchronous" design with a fully Processor with On-Chip Learning. IEEE Micro, 38(1), Jan. 2018.
synchronous design as we focus on designing a function- [6] S. K. Esser, R. Appuswamy, P. Merolla, J. V. Arthur, and D. S.
Modha. Backpropagation for energy-efficient neuromorphic computing.
ally correct synaptic cores and basic leaky-integrate-and- In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
fire neuron model. This allowed us to rapidly manipulate editors, Advances in Neural Information Processing Systems 28, pages
core components without needing to continually reconfigure 1117–1125. Curran Associates, Inc., 2015.
[7] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy,
our FPGA place-and-route tool chains to meet asynchronous A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R.
timing requirements, and investigate applications which have Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner,
difficulty being mapped due to the architectural constraints, and D. S. Modha. Convolutional networks for fast, energy-efficient
neuromorphic computing. Proceedings of the National Academy of
as we demonstrated with the case study on VMM requiring Sciences, 113(41):11441–11446, 2016.
neuron copies to correctly function. [8] K. L. Fair, D. R. Mendat, A. G. Andreou, C. J. Rozell, J. Romberg, and
We believe there is room for reducing the resource usage D. V. Anderson. Sparse coding using the locally competitive algorithm
on the TrueNorth neurosynaptic system. Frontiers in Neuroscience,
for a better salable emulation platform. We plan to optimize 13:754, 2019.
the BRAM usage by replacing the method of reading all [9] K. D. Fischl, A. G. Andreou, T. C. Stewart, and K. Fair. Implementation
386 bits in a single clock cycle with a design that reads of the neural engineering framework on the TrueNorth neurosynaptic
system. In IEEE Biomedical Circuits and Systems Conference (BioCAS),
from the core sram in 72 bit bursts over multiple clock pages 1–4, Oct 2018.
cycles aligned with the 512x72 BRAM configuration. As [10] M. L. Hines and N. T. Carnevale. The neuron simulation environment.
future work we plan to build on our resource efficient way of Neural Computation, 9(6):1179–1209, 1997.
[11] R. Hoang, D. Tanna, L. Jayet Bray, S. Dascalu, and F. Harris. A novel
mapping the VMM and implement applications such as sparse cpu/gpu simulation environment for large-scale biologically realistic
matrix approximation and convolution. The ability to process neural modeling. Frontiers in Neuroinformatics, 7:19, 2013.
convolution in turn will allow us to target a much broader [12] N. Imam, K. Wecker, J. Tse, R. Karmazin, and R. Manohar. Neural
spiking dynamics in asynchronous digital circuits. In 2013 International
class of image recognition tasks, such as Synthetic Aperture Joint Conference on Neural Networks (IJCNN), pages 1–8, Aug 2013.
Radar (SAR) classification; dealing with more complex images [13] A. Jimeno-Yepes, J. Tang, and B. S. Mashford. Improving classification
compared to MNIST. Additionally, we will investigate the accuracy of feedforward neural networks for spiking neuromorphic
chips. In IJCAI, 2017.
model accuracy challenges of a neuromorphic system while [14] D. R. Mendat, A. S. Cassidy, G. Zarrella, and A. G. Andreou. Word2vec
maintaining its energy efficient execution flow by studying word similarities on IBM’s TrueNorth neurosynaptic system. In Biomed-
correlation between training methods, accuracy, and architec- ical Circuits and Systems Conference (BioCAS), pages 1–4, Oct 2018.
[15] R. Preissl, T. M. Wong, P. Datta, M. Flickner, R. Singh, S. K. Esser, W. P.
ture configuration parameters. Risk, H. D. Simon, and D. S. Modha. Compass: A scalable simulator
for an architecture for cognitive computing. In High Performance
V. ACKNOWLEDGEMENTS Computing, Networking, Storage and Analysis (SC), 2012 International
Research reported in this publication was supported in Conference for, pages 1–11, Nov 2012.
[16] S. Schmitt, J. Klähn, G. Bellec, A. Grübl, M. Güttler, A. Hartel, S. Hart-
part by Raytheon Missile Systems under the contract 2017- mann, D. Husmann, K. Husmann, S. Jeltsch, V. Karasenko, M. Kleider,
UNI-0008. The content is solely the responsibility of the C. Koke, A. Kononov, C. Mauch, E. Müller, P. Müller, J. Partzsch,
authors and does not necessarily represent the official views M. A. Petrovici, S. Schiefer, S. Scholze, V. Thanasoulis, B. Vogginger,
R. Legenstein, W. Maass, C. Mayr, R. Schüffny, J. Schemmel, and
of Raytheon Missile Systems. K. Meier. Neuromorphic hardware in the loop: Training a deep spiking
network on the brainscales wafer-scale system. In 2017 Int. Joint
R EFERENCES Conference on Neural Networks (IJCNN), pages 2227–34, May 2017.
[1] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, [17] E. Yavuz, J. Turner, and T. Nowotny. Genn: a code generation framework
P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. J. Nam, B. Taba, for accelerated brain simulations. In Scientific Reports, volume 6,
M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk, B. Jackson, January 2016.