0% found this document useful (0 votes)
79 views8 pages

FPGA Based Emulation Environment For

This document presents an FPGA-based emulation environment for neuromorphic computing architectures like IBM's TrueNorth. The authors implement a parameterized and configurable emulation of the TrueNorth architecture on a Zynq UltraScale+ MPSoC FPGA. They validate the emulation environment by running neural networks for an MNIST digit classification task and a vector-matrix multiplication task, and compare the results to simulations from IBM's Compass environment. They also demonstrate the ability to alter the neuron design to reduce resource requirements without affecting accuracy.

Uploaded by

mahanmahan192002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views8 pages

FPGA Based Emulation Environment For

This document presents an FPGA-based emulation environment for neuromorphic computing architectures like IBM's TrueNorth. The authors implement a parameterized and configurable emulation of the TrueNorth architecture on a Zynq UltraScale+ MPSoC FPGA. They validate the emulation environment by running neural networks for an MNIST digit classification task and a vector-matrix multiplication task, and compare the results to simulations from IBM's Compass environment. They also demonstrate the ability to alter the neuron design to reduce resource requirements without affecting accuracy.

Uploaded by

mahanmahan192002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

FPGA Based Emulation Environment for

Neuromorphic Architectures
Spencer Valancius∗ , Edward Richter∗ , Ruben Purdy∗ , Kris Rockowitz∗ , Michael Inouye∗ , Joshua Mack∗ ,
Nirmal Kumbhare∗ , Kaitlin Fair† , John Mixter‡ and Ali Akoglu∗
∗ Departmentof Electrical and Computer Engineering, University of Arizona, Tucson, AZ, 85719 USA
{svalancius12, edwardrichter, rubenpurdy, rockowitzks, mikesinouye, jmack2545, nirmalk, akoglu}@email.arizona.edu
† Air Force Research Labs, Florida, 32542 USA

[email protected]
‡ Raytheon Missile Systems, Tucson, AZ, 85747 USA

[email protected]
arXiv:2004.06061v1 [cs.ET] 8 Apr 2020

Abstract—Neuromorphic architectures such as IBM’s precision, synaptic delay, the number of neurons and axons
TrueNorth and Intel’s Loihi have been introduced as platforms in a core, number of cores, neuron count per core, network
for energy efficient spiking neural network execution. However, topology, and the constraints used during training networks
there is no framework that allows for rapidly experimenting with
neuromorphic architectures and studying the trade space on for deployment onto the target neuromorphic architecture.
hardware performance and network accuracy. Fundamentally, There is a need for an open-source configurable emulation
this creates a barrier to entry for hardware designers looking environment for hardware architects and application engineers
to explore neuromorphic architectures. In this paper we to investigate performance bottlenecks and accordingly alter
present an open-source FPGA based emulation environment the architecture by investigating the impact of their design de-
for neuromorphic computing research. We prototype IBM’s
TrueNorth architecture as a reference design and discuss cisions on hardware performance through trend based analysis.
FPGA specific design decisions made when implementing Such design space exploration and prototyping based efforts
and integrating it’s core components. We conduct resource are not feasible without an emulation environment as these
utilization analysis and realize a streaming-enabled TrueNorth neuromorphic chips are designed as ASICs.
architecture on the Zynq UltraScale+ MPSoC. We then In this study we present a parameterized and configurable
perform functional verification by implementing networks for
MNIST dataset and vector matrix multiplication (VMM) in emulation platform that serves as a basis for supporting other
our emulation environment and present an accuracy-based neuromorphic architectures or investigating new architectures
comparison based on the same networks generated using IBM’s targeted for different application domains. We recreate and
Compass simulation environment. We demonstrate the utility implement the TrueNorth architecture as a reference design on
of our emulation environment for hardware designers and the Xilinx Zynq UltraScale+ MPSoC ZCU102. We validate
application engineers by altering the neuron behavior for VMM
mapping, which is, to the best of our knowledge, not feasible the functionality of our emulation environment using both
with any other tool including IBM’s Compass environment. The the MNIST dataset and vector matrix multiplication as case
proposed parameterized and configurable emulation platform studies. For the MNIST dataset we implement a network by
serves as a basis for expanding its features to support emerging Esser et al. [6] in our environment and compare the results of
architectures, studying hypothetical neuromorphic architectures, the two networks. For the case of vector matrix multiplication
or rapidly converging to hardware configuration through
incremental changes based on bottlenecks as they become (VMM), we replicate VMM mapping method of Fair et al. [8]
apparent during application mapping process. and compare the results against similar networks generated
Keywords: Neuromorphic computing, Emulation, FPGA. using IBM’s Compass environment [15]. We then demonstrate
the architectural prototyping capabilities of our environment
I. I NTRODUCTION by introducing a single change to the neuron block component.
Spiking neural network (SNN) architectures have been This alteration, without accuracy degradation to either the
proposed with the goal of creating non-von Neumann ar- MNIST or VMM case studies, reduces the resource require-
chitectures that emphasize the strengths of biologically in- ments of the VMM networks by 50%.
spired neural networks: low-power, high parallelism, and fast Even though TrueNorth and Loihi architectures are different
complex computations [12], [16]. IBM’s TrueNorth Chip [1] in terms of packet processing, SNN mapping, core architec-
and Intel’s Loihi [5] are examples of such architectures for ture, and core synchronization, our emulation environment
modeling leaky-integrate-and-fire neurons with the capability allows tuning aforementioned key configuration parameters to
of implementing multiple types of dynamic and stochastic execute SNNs targeted for Loihi. The parameterized design
neuron models. Neuromorphic computing architectures have a enables the manipulation of core components, without the need
number of configuration parameters that are not inherent to the to recreate the entire design from scratch. This allows for a
hardware such as the number of weights a neuron can have, user to selectively utilize TrueNorth components that interface
the bitwidth of these weights, synaptic weight memory and with their own unique implementations, such as using the
Fig. 2: Neuron Block

potential is less-than a negative threshold, then the neuron


Fig. 1: TrueNorth core comprised of five components: neuron block resets the neuron potential to a reset-potential value, but
block, core sram, router, scheduler, and token controller. does not produce a spike. In the event that neither of these
cases occur, the neuron’s potential is saved and used as the
TrueNorth’s crossbar but changing out the routing network start of new running sum during the next cycle [1]–[3], [12].
for some other interconnect architecture. The modularity of For our emulation we produce a neuron block model as
our emulation environment allows for incremental changes shown in Figure 2. When a new neuron is loaded into the
to implement features such as on-chip learning, core-to-core neuron block, the four synaptic weights are loaded into the
multicast, core management packets, and configurable synaptic multiplexer unit on the far left of the image. The synaptic
weight precision (signed and unsigned) offered by the Loihi. weight is selected based on the current input axon’s synaptic
The remainder of this paper is organized as follows: In weight index value. This index value is held constant for a
Section II we introduce each individual component that create given axon. For example, if input axon 0 has a synaptic weight
a single TrueNorth Core and describe our key FPGA im- index value of 3, that index value will be used to select the
plementation methods. In Section III, we present hardware same index of synaptic weight from the mux for all neurons.
resource requirements of the proposed emulation environment, If, for a given input axon, there exists a binary spike as well
analyze its scalability on the FPGA and discuss our validation as a synaptic connection, then the next mux will allow the
approach. Finally, in Section IV we present our conclusions selected synaptic weight through, and it will be added to the
and planned future work. current neuron potential. If either of these conditions is not
II. R EFERENCE A RCHITECTURE OVERVIEW AND met, then this mux will send the value of 0 through and no
I MPLEMENTATION change will be made to the neuron’s potential.
The updated neuron potential is saved in the register to be
A single TrueNorth chip is comprised of 4,096 neurosynap-
used on the next clock cycle for the next axon. A leak value
tic cores. Each core is comprised of five components [1] as
λ is constantly incorporated into the updated neuron potential
shown in Figure 1. In this section we summarize functionality
to allow for the modification of the potential value outside of
of each component and describe our FPGA based implemen-
spikes. The leak modified neuron potential is then compared
tation approach.
against the positive and negative thresholds to determine if a
A. Neuron Block spike is to be processed as well as which reset-value, if any,
The neuron block is the primary computational component to utilize. Once the current neuron has been evaluated, a new
for our reference architecture. The purpose of the neuron neuron’s neuron potential is used for the running sum, rather
block is to compute a running sum value known as the than continuing with the old neuron’s value.
neuron potential, which is the sum of the weights associated
B. Core SRAM
with the input axons, and accumulate it with the existing
potential value. Each of the individual neurons within a core The core sram is the main memory for a TrueNorth core.
are sequentially evaluated. During a neuron’s evaluation, all Each core sram is a matrix of 256 rows by X columns where
input axons are checked for a binary spike as well as a synaptic X is the total number of bits required to fully encode a single
connection to the current neuron. If both the binary spike on neuron with all of its respective parameters. Our core sram
the axon, and a synaptic connection between the neuron and model is very much like the one described by Akopyan et
the axon exists, then a corresponding synaptic weight is added al. [1]. In our implementation X is equal to 386 bits and is
to the neuron’s potential. Once all input axons have been broken down as shown in Table I. In a core sram, each of the
checked, a leak value is then applied to the potential. This 256 rows represents one of the 256 neurons contained within
potential is then evaluated against an asymmetrical threshold a single TrueNorth core.
value. If the potential is greater-than or equal-to a positive For our emulation’s core sram component, we implement
threshold, then the neuron block produces a spike, and the it as two different modules: controller and memory. The
neuron potential is reset to a reset-potential value. If the controller determines which row, or rather which neuron, in
TABLE I: Core SRAM Parameter Breakdown.
Parameter Name Bit Width
Synaptic Connections 256 bits
Potential and Neuron Parameters 100 bits
Spike Destination 26 bits
Delivery Tick 4 bits
Potential & Neuron Parameters Bit Width
Current Potential 9 bits
Reset Potential 9 bits
Weights 0 1 2 and 3 9 bits each
Leak Value 9 bits
Positive/Negative Threshold 18 bits each
Reset Mode 1 bit

the memory module to read from, as well as notifies the token (a) Our Router
controller when all rows have been processed. By separating
the controller from the memory module, we assist the synthesis
tool in mapping the memory module to the FPGA’s BRAM
more efficiently. This results with implementing the core sram
module with only 5.5 BRAM blocks per core, which would
have required 386 BRAM blocks otherwise.
Each neuron accounts for 386 bits of information, therefore
each emulated core requires 98,816 bits (256∗386) to store all
the neuron information. However, 5.5 BRAM elements offers
202,752 bits of memory, which is much more than the 98,816
bits that we are storing for a single core sram. Despite using
only around 50% of the total memory storage available in the
provided 5.5 BRAM blocks, our core sram needs to be able (b) IBM TrueNorth Router
to read all 386 bits for a single neuron in a single clock cycle. Fig. 3: Differences between our router and TrueNorth router:
Our core sram uses 5 36-Kb BRAM blocks in a 512 × 72 Synchronous design and reorganizing buffers by moving them
configuration and a single 18-Kb BRAM block in a 512 × 36 after the merge simplified back-pressure logic and increased
configuration to match this need. our ability to have high throughput in times of high congestion.
C. Router
The router is responsible for the inter-core communication times of high congestion, the buffers are necessary to achieve
in the TrueNorth chip and enables delivering spike packets high throughput. Second, as packets travel through the routing
between adjacent cores from source neurons to destination network, the FIFO buffers that capture packets within each
axons. Each spike in the TrueNorth network is represented forward module become full. This is addressed by applying
as a packet, which contains information regarding number of a form of back-pressure into the network as discussed by
cores to travel horizontally (∆x) [9 bits] and vertically (∆y) Akopyan et al. [1]. Buffers enable back-pressure to ensure
[9 bits], which axon in the core to be delivered to [8 bits], that packets are not lost when traveling through the network.
and which tick to be delivered on [4 bits]. Packets travel first A significant difference between our implementation and
horizontally, and then vertically across the two-dimensional the original router implementation in IBM’s TrueNorth shown
mesh of cores until they arrive at the destination core, where in Figure 3b is the placement and number of buffers. In the
they are sent to the scheduler. At the final destination, the original implementation, each cardinal direction has a single
scheduler uses the remaining bit values to determine which buffer to store inputs arriving at that forwarding direction.
axon and tick instance to save the spike to. This setup requires complicated backpressure logic as each
Figure 3a shows our implementation of the router com- buffer can send packets in multiple directions. For example,
ponent. Each router has forward east, forward west, forward the buffers in forward east can go to forward north, forward
north, forward south, from local, and to local modules, which south, or to the eastern core’s forward east module. There-
are used to communicate with both the internal modules of the fore, in order to implement backpressure, the buffer needs
core as well as the adjacent cores in every cardinal direction. to receive feedback from these three possible locations. Our
Forward east, forward west, forward north, and forward south implementation increases the number of buffers. Rather than
each have one to three FIFO buffers which are necessary for having a single buffer that buffers the input into the forward
two reasons. First, when mapping a non-trivial application to module, we have a buffer for each module’s output. This places
TrueNorth, the number of packets simultaneously traveling each buffer in-between two merges (except for the two buffers
through the network on-chip can be significantly large. At in from local). The merge at the output of the buffer will
Fig. 4: A synchronous scheduler design replaces 16 asyn-
chronous control blocks with a counter.

send a read_enable signal to the buffer when the buffer is


not empty and the buffers at the output of the merge are not Fig. 5: Token controller state machine.
full. Each buffer sends a buffer_full signal to the merge at
its input to ensure that new data will not arrive when the so that we may use the BRAM blocks wholly for the core
buffer is full. This reduces the logic required to implement the SRAM components. As with the original scheduler design
router’s backpressure. Additionally, by buffering the outputs in TrueNorth, our emulation scheduler does not allow spike
of a forward module over buffering the inputs, we obtain a packets to be written to a SRAM column corresponding to the
throughput increase of 3x for horizontal forwarding modules current tick instance. In TrueNorth’s scheduler, if this occurs
and 2x for vertical forwarding modules. the packet is dropped and an error flag is sent to the token
controller, which causes another flag to be raised to alert
D. Scheduler the user that information may no longer be valid. For our
The scheduler is the final stop for a spike packet that is own emulation design we take this error flag and bypass the
traversing the cores through the routing network. By the time token controller, sending it out to the user. This allows us
a spike packet arrives at its destination core, it is reduced to 12 to distinguish between an error that has occurred within the
bits with 8 bits of axon and 4 bits of tick offset values. The scheduler, and an error within the token controller. Being able
scheduler contains a 256 row by 16 column SRAM. The 8 to distinguish between these two errors is a key item for our
bit axon value of the spike packet is used to determine which emulation environment, as new architecture prototyping may
of the 256 rows the newly arrived spike will be written on. cause a distinct error in only one of these locations. Knowing
The 4 bit tick offset is used to determine which column, with which location can assist researchers in the debugging process.
respect to the currently active column, the spike is written to.
In the event that the tick offset causes the spike to be written E. Token Controller
to the active column, or the current tick, an error is thrown The token controller maintains global synchronous behavior
and the packet is dropped. This error does not cause any part of the TrueNorth core, as well as intercommunication be-
of the TrueNorth core to halt. It is used to alert the user that tween the other four components in a single core. The token
information may no longer be accurate [1]. controller is a 269 state asynchronous FSM in the original
The scheduler in the reference design is comprised of the design. 256 states are used to evaluate the individual input
SRAM memory and control blocks. The SRAM memory is axons to determine if a binary spike as well as a connection
used to store the spike packets that arrive from the core’s to the current neuron exists. In the event this is true, the same
router. The control block determines which spikes to process state sends the appropriate information to the neuron block
in a given tick instance. The scheduler contains sixteen of and uses an asynchronous communication tree referred to as
these control blocks corresponding to each of the 16 columns the request/acknowledge tree to meet timing constraints. If
(16 tick instances), in which only one of these control blocks an axon does not have either a binary spike or a synaptic
is active at a time by means of a passing token. The active connection with the current neuron then it moves on to the
control block reads the corresponding SRAM column and next state in the FSM. The remaining 13 states reference the
sends the data and the input axon spike information to the active neuron’s information from the core sram for the purpose
token controller. It is also responsible for then clearing the of setting the current tick instance within the scheduler and to
column once the token controller has completed its full FSM retrieve the input axon binary spikes from the scheduler sram.
circuit and is back to waiting for the next tick. A spike packet is sent to the router if the core determines
In our emulation design, we propose replacing the control the neuron block has met its defined threshold. The controller
logic with a 4-bit counter that increments every time it receives FSM then checks if all neurons have been evaluated.
the tick as illustrated in Figure 4. As the counter updates based In our design we take advantage of our fully synchronous
on a signal received from the token controller, the counter design to reduce the number of states from 269 to 8 as shown
value is sent into the look up table (LUT) memory blocks in Figure 5. We are able to do this by collapsing the 256
that comprise the scheduler SRAM. We purposefully use LUT states used to evaluate the input axons into only a single state
memory for the scheduler over using additional BRAM blocks that loops for the number of input axons in our core. We are
TABLE II: Post implementation resource usage by component for 1 and 5 core networks on Xilinx Zynq Ultrascale+ XCZU9EG
FPGA based on Look Up Table Logic (LUTs), Look Up Table Memory(LUT-RAM), Block Random Access Memory (BRAM).
Component LUTs LUT-RAM FFs BRAM Critical Delay
Network Size 1 5 1 5 1 5 1 5 1
Core SRAM 91 455 0 0 0 0 5.5 27.5 3.338 ns
Neuron Block 39 195 0 0 9 45 0 0 1.670 ns
Token Controller 46 230 0 0 32 160 0 0 5.321 ns
Scheduler 376 1880 304 1520 4 20 0 0 4.647 ns
Router 1418 7862 0 0 1167 2696 0 0 11.551 ns
Total Available 274080 144000 548160 912

able to achieve this state reduction by first collapsing the 256 ARM Cortex-A53 CPU integrated with a Programmable Logic
states used in TrueNorth for the input axon evaluation into a (PL). The neuron parameters for the core SRAM, neuron
single state loop (state 4 in Figure 5), where the number of instructions of each core, and the spikes are first loaded
loop iterations is equal to the number of input axons of 256. into the emulator. These three files are generated offline for
Among the remaining thirteen states used in the TrueNorth deploying the network on the FPGA after all constraining and
token controller, we merge five states that deal with address training has been completed using the constrain-then-train
updating to the scheduler and core sram components (state methodology described by Esser et al. [7]. In order to
1), and merge two states that deal with updating the current move data in and out of our platform, we utilize the DMA
neuron’s potential into the core sram and sending the spike functionality of the Zynq SoC. The emulation utilizes both
information to the router (state 5). We re-purpose the spike the PS and PL resources of the MPSoC platform. Overall
debugging state in the original TrueNorth design to instead architecture of the emulation environment is composed of two
switch off the valid bit signal sent to our router after we deliver ARM CPU cores on the PS side acting as the host threads,
a spike packet from the current neuron (state 6). For our first and on the PL side, a DMA engine, buffer, and the TrueNorth
and last state we retain their original uses as described by module. The host thread of the first ARM CPU core reads
TrueNorth. Lastly, we are able to remove three states from packets from a binary file, which can be shared using an SD
the TrueNorth design, as they correspond to states that don’t card. The packets are then written to shared memory, which
impact the overall behavior of our emulation design. the PL can access using the DMA engine. The packets are
fed from memory into the TrueNorth module, where they are
F. Output Buffer
buffered to be read at each tick. While running, the output
An additional component required by our emulation envi- packets are also buffered, and at the end of each tick they are
ronment is the output buffer. In Compass [15], cores desig- written to a separate part of the shared memory. The second
nated with output neurons send packets to an output buffer. ARM core reads from the shared memory and writes the
This buffer is used to retrieve all output spike packets and output packets back to the SD card.
send them to the user in a single tick instance, rather than
the user receiving spike packets at irregular intervals. This B. Hardware Implementation Results
extra buffer component adds an extra tick instance between
the outputs of a TrueNorth core and the user. To ensure Table II shows the resource utilization and timing analysis
the output timings match up with the Compass results, we on the Zynq Ultrascale+ for two network sizes. We use the
constructed a simple output buffer component that is attached single and five core implementations for functional verification
to the emulated network. Output spikes are retrieved by this against the VMM and MNIST reference designs respectively.
component, accumulated during a tick instance, and then sent The single core occupies 0.72% of the logic resources, 0.21%
to the user at the start of the next tick instance. of the logic-memory resources, and 0.60% of the BRAM
resources. When we scale the network to the five cores,
III. V ERIFICATION AND S CALABILITY resource utilization increases linearly, occupying 3.88% of
In this section, we first present hardware setup and FPGA logic resources, 1.06% of logic-memory resources, and 3.02%
resource utilization for our TrueNorth prototype. Next, we of the BRAM resources. Core computations involve 9-bit
present our approach of functional verification between our signed weight addition in the neuron block, along with 9-bit
FPGA prototype and IBM’s TrueNorth simulator, Compass signed increment and decrement operations in the router block.
[15], by implementing a 9-bit signed vector-matrix multipli- Each core operates at global tick rate of 1KHz [1].
cation (VMM) algorithm. We then perform functional ver- We show the resource utilization trend with respect to
ification by comparing results on the MNIST dataset with increase in the number of cores in Table III. We sweep
published TrueNorth based implementation results [6]. the resource usage space by beginning with our single core
implementation. We then expand out in the x and y directions
A. Streaming Framework of our 2D network grid maintaining a square network. We
We use the Xilinx Zynq Ultrascale+ MPSoC observe that the network scales in a seemingly linear fashion,
(XCZU9EG2FFVB1156) as the implementation platform, with the primary resource demands being on the LUT and
which consists of a Programmable System (PS) of Quad-core BRAM components. We are able to create a 110 core (10x11)
TABLE III: Hardware resource usage with respect to the by implementing a feedback system that reroutes spikes back
number of emulated TrueNorth cores. LUTs determine the to the core and drives the negative neuron potential back to
scalability, reaching nearly 98% utilization at 110 cores. zero [8]. This feedback system requires an additional doubling
NETWORK LUT LUTRAM FF BRAM of the number of neurons.
SIZE (%) (%) (%) (%) The scalability of VMM mapping to TrueNorth is limited
1 8.40 0.31 3.15 0.60
4 9.99 0.73 3.68 1.81
by this asymmetry of neuron potential reset thresholds. The
9 14.03 1.79 5.05 4.82 duplication of the number of axons and neurons severely
16 19.75 3.27 7.01 9.05 limits scalability, quickly exhausts resources on TrueNorth for
25 27.15 5.17 9.55 14.47
36 36.24 7.49 12.68 21.11
larger VMM problems and requires a cluster of TrueNorth
49 47.01 10.23 16.40 28.95 chips to map convolution, locally competitive algorithm, least
64 59.47 13.40 20.70 37.99 squares minimization, or support vector machine training [8]
81 73.62 16.99 25.59 48.25
100 89.45 21.00 31.07 59.70
types of applications. Resorting to a resource replication
110 97.78 23.11 33.95 65.73 type of workaround to accommodate signed multiplication
is inevitable when restricted by the fixed architecture. We
network, bounded by LUT utilization, as a rectangular grid on identify this problem as a key case study for demonstrating
the Zynq Ultrascale+ XCZU9EG. the utility of our FPGA based emulation environment, where
an application engineer has the ability to change the hardware
C. Vector Matrix Multiplication Verification behavior and eliminate the need for resource duplication
As shown by Fair et al. [8], the mapping of vector-matrix completely. In our emulation environment, the imbalance
multiplication (VMM) onto TrueNorth using Compass [15], of equality operators is quickly resolved by modifying the
IBM’s TrueNorth simulator, spreads computation across net- negative threshold behavior such that it uses a "≤" comparison
work, core and neuron components, and runs for hundreds rather than a "<" comparison. Despite the simplicity of this
of ticks. Therefore, it is an ideal application for testing our change in hardware, it is infeasible within Compass due to
emulation environment. Furthermore, VMM has also proven the fixed-nature of the TrueNorth architecture. The proposed
to be a core building block in implementing multiple sophisti- symmetric threshold based hardware modification eliminates
cated algorithms on TrueNorth such as the locally-competitive the need for the feedback system, and it enables resource
algorithm [8], Word2Vec word similarity calculation [14], and reduction,which we discuss next.
the neural engineering framework [9]. We use the 9-bit signed To demonstrate the results, we map an 8 × 8 matrix, the
VMM to verify the behavioral functionality of our emulation largest which fits a single 256 × 256 core with feedback.
prototype against its implementation in Compass. We created Each column requires 8 neurons for the positive representation,
100 random matrices ranging from 2x3 to 8x8 and a random 8 neurons for the negative representation, and 16 neurons
vector of 9-bit signed integers of the appropriate size for each for feedback. This is 32 neurons per column, multiplied
matrix. Each vector-matrix pair was mapped to Compass [15] by 8 columns, yielding 256 neurons. 16 feedback neurons
using the method proposed by Fair [8]. We then mapped the multiplied by 8 columns, yields 128 feedback neurons, which
same matrix-vector pairs to our FPGA emulation and found a necessitate 128 axons by which to connect. The maximum
one-to-one match between the two. vector of 1 × 8 requires 1 axon per column for positive and
1 axon per column for negative representation. Duplicating
D. Modifying Neuron Behaviour for Efficient VMM Mapping this number ensures correspondence between signed inputs
Mapping signed VMM requires representation of positive and signed matrix values resulting with 4 axons per column
and negative values. As the rate encoded input spikes lack sign, and a total of 32 axons. The 128 feedback axons, required by
to make signed VMM mappable to the TrueNorth, Fair et al. the 8 × 8 matrix, are added to the 32 input axons, creating a
[8] duplicates the axons in a core, dividing them into positive 160 axon, 256 neuron core. Eliminating the feedback system,
and negative groups, where positive and negative input spikes leaves behind a 32 axon, 128 neuron core to solve the same
are routed to their respective groups. Similarly, the neurons VMM problem and reduces the number of neurons by 50%.
are duplicated and divided, allowing them to represent positive In order to validate our resource reduction analysis, we
and negative outputs from respective connected axons. Neuron implement the signed VMM mapping method of Fair et al. [8]
block operations proceed as previously described, facilitating for the 1 × 8 vector and 8 × 8 matrix on the reference
simultaneous positive and negative value operation without architecture that is emulated using the Zynq Ultrascale+ MP-
previous knowledge of sign. SoC. We then implement the same VMM problem on the
While the positive threshold is evaluated using ≥, the neg- proposed architecture that supports symmetric threshold and
ative threshold uses the < operator as illustrated in Figure 2. eliminates the feedback loop. We show the resource usage of
Uncorrected, this asymmetry allows the potential of a neuron the functionally equivalent VMM mappings on the reference
to remain negative when it should otherwise be reset to zero, and proposed architectures in Table V. Elimination of the
thus producing an incorrect number of output spikes. This is feedback loop removes half the dimensions of a standard core,
depicted in Table IV. Despite the neuron and axon duplication in turn reduces the necessary bit allotment for the scheduler
that allows signed VMM, correct output can only be achieved and core sram. As shown in Table III, the standard 256x256
TABLE IV: Table depicts the tick by tick interaction between incoming spikes, connection weights, neuron potentials and
output spikes depending upon axon type. We see that the symmetry of the reset thresholds affects the state of the neuron
potential after successive ticks, with the asymmetric potential remaining negative until a positive value drives it back toward
zero. In applications like VMM, the positive (+) and negative (-) representations rely on identical behavior of positive and
negative potential resets to allow simultaneous positive and negative values to be calculated and represented by the neurons.
Due to the configurability of our emulation environment, the feedback system used to correct the asymmetry is easily discarded.
Asymmetric Symmetric
Tick Axon Spike Weight(+, -) Potential(+) Potential(-) Output(+) Output(-) Potential(+) Potential(-) Output(+) Output(-)
1 0 1 1, -1 1 -1 1 0 1 1 1 0
2 X 0 X 0 -1 0 0 0 0 0 0
3 1 1 -1, 1 -1 0 0 0 -1 1 0 1

TABLE V: Post implementation resource utilization when


running a 9-bit signed VMM implementation on a single
core. The first row represents the core before architectural
modification, while the second row represents the core after the
neuron behavior modification and removal of feedback system.
Design LUT LUT-RAM FF BRAM Delay (ns)
Reference 1700 192 1210 4 10.266
Proposed 1165 48 1056 2 8.026
Reduction (%) 31.5 75.0 12.7 50.0 21.8

Fig. 7: For the MNIST data set we modify our Output Core
to output all class votes as they are accumulated. For the first
six ticks of the data set, we generate the resulting votes in the
above histogram. For instances where a tie occurs, the Output
Core is set to select the first instance.
having a size of 28x28 pixels, to ensure that our input layer
of cores are fully connected, we split the images into four
sections of 16x16 windows with each window separated using
a stride of 12 pixels. Each core of our input layer only uses 64
Fig. 6: 5 core network implementation with four 16x16
out of the 256 total neurons available to them, as each input
windows with a stride of 12 represented by blocks a-d. These
core represents one-quarter of the image. The image splitting
generate a 256 input fan-in for the input layer of cores in our
method ensures each quarter is weighted evenly within the
network. Input layer only uses 64 of the 256 possible neurons
classification core. The classification core then uses only 250
and outputs those to the classification core in the next layer.
out of the total 256 neurons to evenly distribute 25 neurons
per class for each of the ten classes within MNIST data set.
core requires 5.5 BRAMs, which the core sram occupies.
When running our implementation, the output core gen-
For the 8 × 8 VMM problem, based on the reference design
erates all votes for each class in the MNIST as they are
with the feedback loop, we note the smaller dimensions of
accumulated. This allows us to produce a histogram similar
160 × 256, requires 4 BRAMs. For the proposed symmetric
to Figure 7, which shows the number of votes for each
threshold based architecture, the 32 × 128 core requires 2
digit across multiple ticks. By comparing against Compass
BRAMs, which confirms the expected 50% reduction. Instead
we verified that these histograms were matching and the
of an 8-bit counter for the scheduler that counts up to 256,
correct digit was being selected. Our five core implementation
the proposed design requires a 5-bit counter to index each
achieves an accuracy of 96.28% on the MNIST data set,
of the 32 axons. This enables more efficient mapping of the
which is comparable to the accuracy achieved by Yepes et.
scheduler to the LUT-RAMs, reducing total utilization by 75%.
al [13]. Our emulation environment takes 10 seconds to fully
Additionally, removing the feedback system not only permits
infer the 10,000 testing images of the MNIST. An in-house
a matrix capacity double the previous to occupy a single core,
serial implementation of the same emulation environment
but also reduces the critical path delay by 21.8%, further
takes around 2 hours on an Intel Xeon processor (3GHz, 32GB
bolstering throughput and scalability.
RAM) to fully infer the dataset. We find that even with the
E. MNIST Verification symmetric threshold, our accuracy is unchanged.
In this experiment we implement a five core network that IV. C ONCLUSION
replicates the design introduced by Esser et al. [6], illustrated In this paper we present our approach to implementing an
in Figure 6, while using the training methodology proposed FPGA-based neuromorphic architecture emulation platform.
by Yepes et al. [13]. Due to the MNIST data set images We use IBM’s TrueNorth as a reference and discuss our
hardware design decisions for each architectural component and D. S. Modha. TrueNorth: Design and tool flow of a 65 mw
to make it feasible to implement on the FPGA. We conduct 1 million neuron programmable neurosynaptic chip. IEEE Transac-
tions on Computer-Aided Design of Integrated Circuits and Systems,
hardware resource usage analysis, validate the functionality 34(10):1537–1557, Oct 2015.
of our emulation environment and demonstrate its utility [2] J. V. Arthur, P. A. Merolla, F. Akopyan, R. Alvarez, A. Cassidy,
through case studies based on comparisons with respect to S. Chandra, S. K. Esser, N. Imam, W. Risk, D. B. D. Rubin, R. Manohar,
and D. S. Modha. Building block of a programmable neuromorphic
the published results. To the best of our knowledge this is the substrate: A digital neurosynaptic core. In The 2012 International Joint
first academic work on FPGA based emulation environment Conference on Neural Networks (IJCNN), pages 1–8, June 2012.
for simulating the clusters of leaky-integrate-and-fire neuron [3] A. S. Cassidy, P. Merolla, J. V. Arthur, S. K. Esser, B. Jackson,
R. Alvarez-Icaza, P. Datta, J. Sawada, T. M. Wong, V. Feldman, A. Amir,
models integrated with the principle router, scheduler, and D. B. D. Rubin, F. Akopyan, E. McQuinn, W. P. Risk, and D. S. Modha.
memory management components. Unlike other approaches Cognitive computing building block: A versatile and efficient digital
(e.g., [4], [10], [11], [17]) that are presented towards achieving neuron model for neurosynaptic cores. In The 2013 International Joint
Conference on Neural Networks (IJCNN), pages 1–10, Aug 2013.
large scale spiking neural network simulations, the proposed [4] T. Chou, H. J. Kashyap, J. Xing, S. Listopad, E. L. Rounds, M. Beyeler,
open-source, parameterized and modular emulation environ- N. Dutt, and J. L. Krichmar. Carlsim 4: An open source library for
ment serves as a basis to conduct hardware architecture large scale, biologically detailed spiking neural network simulation using
heterogeneous clusters. In 2018 International Joint Conference on
research for neuromorphic computing and investigate the trade Neural Networks (IJCNN), pages 1–8, July 2018.
space between mapping strategies, hardware performance and [5] M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao, S. H. Choday,
accuracy for the target applications. G. Dimou, P. Joshi, N. Imam, S. Jain, Y. Liao, C. Lin, A. Lines, R. Liu,
D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan, Y. Weng,
Our FPGA-based emulation environment replaces the "glob- A. Wild, Y. Yang, and H. Wang. Loihi: A Neuromorphic Manycore
ally synchronous-locally asynchronous" design with a fully Processor with On-Chip Learning. IEEE Micro, 38(1), Jan. 2018.
synchronous design as we focus on designing a function- [6] S. K. Esser, R. Appuswamy, P. Merolla, J. V. Arthur, and D. S.
Modha. Backpropagation for energy-efficient neuromorphic computing.
ally correct synaptic cores and basic leaky-integrate-and- In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
fire neuron model. This allowed us to rapidly manipulate editors, Advances in Neural Information Processing Systems 28, pages
core components without needing to continually reconfigure 1117–1125. Curran Associates, Inc., 2015.
[7] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy,
our FPGA place-and-route tool chains to meet asynchronous A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R.
timing requirements, and investigate applications which have Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner,
difficulty being mapped due to the architectural constraints, and D. S. Modha. Convolutional networks for fast, energy-efficient
neuromorphic computing. Proceedings of the National Academy of
as we demonstrated with the case study on VMM requiring Sciences, 113(41):11441–11446, 2016.
neuron copies to correctly function. [8] K. L. Fair, D. R. Mendat, A. G. Andreou, C. J. Rozell, J. Romberg, and
We believe there is room for reducing the resource usage D. V. Anderson. Sparse coding using the locally competitive algorithm
on the TrueNorth neurosynaptic system. Frontiers in Neuroscience,
for a better salable emulation platform. We plan to optimize 13:754, 2019.
the BRAM usage by replacing the method of reading all [9] K. D. Fischl, A. G. Andreou, T. C. Stewart, and K. Fair. Implementation
386 bits in a single clock cycle with a design that reads of the neural engineering framework on the TrueNorth neurosynaptic
system. In IEEE Biomedical Circuits and Systems Conference (BioCAS),
from the core sram in 72 bit bursts over multiple clock pages 1–4, Oct 2018.
cycles aligned with the 512x72 BRAM configuration. As [10] M. L. Hines and N. T. Carnevale. The neuron simulation environment.
future work we plan to build on our resource efficient way of Neural Computation, 9(6):1179–1209, 1997.
[11] R. Hoang, D. Tanna, L. Jayet Bray, S. Dascalu, and F. Harris. A novel
mapping the VMM and implement applications such as sparse cpu/gpu simulation environment for large-scale biologically realistic
matrix approximation and convolution. The ability to process neural modeling. Frontiers in Neuroinformatics, 7:19, 2013.
convolution in turn will allow us to target a much broader [12] N. Imam, K. Wecker, J. Tse, R. Karmazin, and R. Manohar. Neural
spiking dynamics in asynchronous digital circuits. In 2013 International
class of image recognition tasks, such as Synthetic Aperture Joint Conference on Neural Networks (IJCNN), pages 1–8, Aug 2013.
Radar (SAR) classification; dealing with more complex images [13] A. Jimeno-Yepes, J. Tang, and B. S. Mashford. Improving classification
compared to MNIST. Additionally, we will investigate the accuracy of feedforward neural networks for spiking neuromorphic
chips. In IJCAI, 2017.
model accuracy challenges of a neuromorphic system while [14] D. R. Mendat, A. S. Cassidy, G. Zarrella, and A. G. Andreou. Word2vec
maintaining its energy efficient execution flow by studying word similarities on IBM’s TrueNorth neurosynaptic system. In Biomed-
correlation between training methods, accuracy, and architec- ical Circuits and Systems Conference (BioCAS), pages 1–4, Oct 2018.
[15] R. Preissl, T. M. Wong, P. Datta, M. Flickner, R. Singh, S. K. Esser, W. P.
ture configuration parameters. Risk, H. D. Simon, and D. S. Modha. Compass: A scalable simulator
for an architecture for cognitive computing. In High Performance
V. ACKNOWLEDGEMENTS Computing, Networking, Storage and Analysis (SC), 2012 International
Research reported in this publication was supported in Conference for, pages 1–11, Nov 2012.
[16] S. Schmitt, J. Klähn, G. Bellec, A. Grübl, M. Güttler, A. Hartel, S. Hart-
part by Raytheon Missile Systems under the contract 2017- mann, D. Husmann, K. Husmann, S. Jeltsch, V. Karasenko, M. Kleider,
UNI-0008. The content is solely the responsibility of the C. Koke, A. Kononov, C. Mauch, E. Müller, P. Müller, J. Partzsch,
authors and does not necessarily represent the official views M. A. Petrovici, S. Schiefer, S. Scholze, V. Thanasoulis, B. Vogginger,
R. Legenstein, W. Maass, C. Mayr, R. Schüffny, J. Schemmel, and
of Raytheon Missile Systems. K. Meier. Neuromorphic hardware in the loop: Training a deep spiking
network on the brainscales wafer-scale system. In 2017 Int. Joint
R EFERENCES Conference on Neural Networks (IJCNN), pages 2227–34, May 2017.
[1] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, [17] E. Yavuz, J. Turner, and T. Nowotny. Genn: a code generation framework
P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. J. Nam, B. Taba, for accelerated brain simulations. In Scientific Reports, volume 6,
M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk, B. Jackson, January 2016.

You might also like